Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

API & schema generation from natural language

BLEEDING EDGE

TRAJECTORY

Advancing

AI generating API endpoints, database schemas, or data models from natural language descriptions of requirements. Includes REST/GraphQL API scaffolding and database schema design; distinct from infrastructure-as-code which targets deployment resources rather than application interfaces.

OVERVIEW

Generating API endpoints and database schemas from natural language remains an experimental capability despite rapid vendor investment and visible production deployments. The promise is clear—describe what you need and get a working REST endpoint, GraphQL schema, or SQL query—but production reliability reveals a persistent bench-to-reality gap. Benchmark accuracy on NL-to-SQL has climbed sharply, yet real-world deployment tells a different story. Frontier models achieve 85-90% accuracy on curated benchmarks (Spider) but collapse to 10-29% on production-scale schemas (BEAVER, BIRD-Interact, enterprise deployments). The reliability gap stems from four distinct failure layers: syntax (mostly solved), schema compliance (constrained decoding partially addresses), semantic validity (unresolved), and distribution shift (unresolved). Production deployments exist (Uber's QueryGPT processing 1.2M queries/month, AutoBE shipping 85-90% success rates, AWS and Spotify tier-1 integrations), but succeed only under heavy guardrails—metadata enrichment, domain-specific fine-tuning, intent classification layers, and test-driven validation pipelines—rather than zero-shot generation. Developer trust remains low: 90% adoption yet only 3% high trust, two-thirds of outputs requiring substantial modification. The defining tension for this bleeding-edge practice is that the fundamental bottleneck is semantic understanding (business logic, metric definitions, implicit relationships) and operational maturity (validation, governance, schema drift management), not schema formatting capability. Until semantic understanding and governance infrastructure close the gap, adoption stays confined to prototyping, legacy bridging, and low-stakes query generation.

CURRENT LANDSCAPE

Late May 2026 evidence confirms the bench-to-production gap is the persistent critical barrier across independent assessments. MIT/Intel/Harvard's BEAVER benchmark (May 2026) establishes the starkest signal: frontier models achieve 10.8% accuracy on proprietary real-world enterprise schemas despite 82% on academic Spider, a 90% failure rate demonstrating the gap is fundamental, not marginal. Real-world deployments document the production formula: Uber's QueryGPT (1.2M queries/month) succeeded through 20+ iterations of intent classification, domain-specific workspace clustering, and context limiting—not better models. Amazon Science's SQL-Trail shows multi-turn reinforcement learning with iterative feedback (not scale) enables 7B/14B models to outperform substantially larger systems by 5%. AutoBE (production tool) ships 85-90% success rates on real examples via AST-based validation, 100% compilation guarantees, and explicit limitations on runtime behavior requiring human review.

Vendor ecosystem maturity is evident across tier-1 platforms. AWS Amazon Q Developer reached GA with production deployments at SmugMug and TCS. Spotify's natural language interface for ad management (May 2026) reports 85% adoption but explicitly deployed as convenience layer, not autonomous generation. Schema awareness now separates viable tools from failures: Analytics Insight's 2026 benchmarking shows schema-connected tools achieve 64-90% accuracy while schema-agnostic approaches invent columns and fail entirely.

The semantic bottleneck and governance requirements dominate production constraints. Bytebase's synthesis of OpenAI, Google Cloud, Vercel, and Hex case studies establishes governance as the success pattern: context limiting, rigorous evaluation, deterministic validation. dpriver identifies 10 production risks (hallucinated schema, unauthorized access, PII exposure) requiring deterministic validation pipelines, not prompts. Schema drift emerges as critical blocking issue—silent field changes cause hidden behavioral regression in agents trained on stale schemas (healthcare case study: 12 of 28 microservices). Governance tools mature: constrained decoding frameworks (Guidance, Outlines, XGrammar) proliferate but JSONSchemaBench reveals significant feature coverage gaps across all frameworks.

Production deployment remains anchored to non-critical use cases: rapid prototyping, legacy API bridging, and exploratory analytics. Practitioners document the reality: 78% zero-shot accuracy means 1-in-5 queries is wrong, often silently. Evaluation maturity emerges: benchmarks report 90%+ but real-world Execution Accuracy drops to 51%, with Snowflake Cortex achieving >90% exception through semantic layers and curated context. Domain-specific, low-resource settings (the production norm) lack annotated training data, creating a chicken-and-egg problem that knowledge distillation frameworks are beginning to address.

TIER HISTORY

ResearchMar-2023 → Jul-2023
Bleeding EdgeJul-2023 → present

EVIDENCE (110)

— Peer-reviewed research: LLMs reliably generate SQL schemas from NL when given schema constraints and structured prompting, no model training required. Confirms schema awareness and guardrails enable production-viable generation.

Text-to-SQL: Comparison of LLM AccuracyAdoption Metrics

— Industry benchmark of 34 LLMs on text-to-SQL with error analysis: 20%+ error rates on complex queries; failures stem from incomplete request parsing, hallucinated columns, and constraint mapping failures. Validation essential for production.

— Microsoft GA product feature: Copilot generates PostgreSQL schema modifications and SQL from NL prompts (e.g., 'convert the hr.employees table to use a JSONB column'). Demonstrates production NL-to-schema generation in tier-1 IDE.

— Production guide for AI-generated FastAPI code and Pydantic schemas from NL specifications. Model analysis: Claude Sonnet excels at async patterns; ChatGPT generates deprecated Pydantic v1 syntax. Explicit prompt guidance prevents AI fallback to v1.

— CRITICAL SIGNAL: Agentic code generation loses 30+ points in assertion pass rates as structural constraints accumulate. LLM agents pass unit tests but violate runtime ORM contracts; constraint ceiling exists, not graceful degradation.

— Formal framework distinguishing technical debt from recurring stochastic tax in probabilistic agentic workflows. Quantifies operational cost structure for API/schema generation systems, identifying tool/schema debt and governance debt vectors.

— Uber's QueryGPT case study: 1.2M queries/month processed, reduced query authoring from 10 minutes to 3 minutes through intent agents and domain-specific workspace clustering, demonstrating production evolution and schema scaling challenges.

— Production tool generating complete backends (Prisma schema, OpenAPI specs, NestJS) from conversational requirements; 40+ specialized agents, 100% compilation guarantee, 85-90% success rates on real-world examples.

HISTORY

  • 2023-H1: Research advances in schema understanding and text-to-SQL, with foundational benchmarks (BIRD) revealing significant accuracy gaps (40% vs 92% human). Early implementations in academic (DBCopilot) and vendor (Postgres/GPT-3) projects. Deployment limited to research and proof-of-concept stages.
  • 2023-H2: Vendor tooling and patent filings accelerate. GraphQL Editor deploys AI-powered schema generation from natural language (September 2023); Google patents schema-based NL-to-API integration (September 2023). Academic research deepens schema routing approaches for massive databases (DBCopilot arxiv, December 2023). No major production deployments; adoption remains in mockup and experimentation phases.
  • 2024-Q1: Vendor product launches and continued academic research. Neurelo launches Cloud Data API Platform (January 2024) with AI-assisted natural language query generation. Academic research advances GraphQL query generation (IJCAI 2024) and reinforces enterprise limitations (CIDR 2024: NL2SQL "far from resolved"). Practitioner feedback highlights API design quality concerns in AI-generated code. Deployment moves into early production but limited to non-critical schema and query generation tasks.
  • 2024-Q2: Vendor consolidation continues with Neurelo maintaining GA platform status and expanding production use for REST and GraphQL API auto-generation from database models. General AI-assisted development tools (Amazon Q) gain enterprise traction with broad productivity claims, though API/schema generation remains a subset of broader capabilities. Adoption remains constrained by accuracy limitations on complex schemas and quality concerns in AI-generated API design. No breakthrough in enterprise-grade NL-to-schema accuracy; deployment still predominantly in lower-stakes schema prototyping and query generation.
  • 2024-Q3: Research advances in schema linking and text-to-SQL continue (E-SQL achieves 66.29% BIRD accuracy; RoSL improves recall by 25.1% for smaller 8B models). Community adoption of GraphQL remains active but schema-related challenges persist (45K StackOverflow analysis). Open-source NL-to-GraphQL tools emerge (talk-to-graphql). Critical assessments surface recurring reliability concerns: 52% error rate in AI-generated API code, security vulnerabilities, and hallucinations. Neurelo tutorials show iterative schema refinement in production tool. Overall trajectory: incremental improvements on specific benchmarks (BIRD) but no breakthrough in production adoption; accuracy remains constrained by schema complexity, and production deployment limited to non-critical schema/query generation tasks.
  • 2024-Q4: Focused research effort on GraphQL query generation (EMNLP 2024 industry track reports ~50% accuracy on new 10,940-pair dataset from IBM/StepZen; open-source NL2GQL dataset released October 2024). Academic interest in schema generation from requirements specifications continues (November 2024 publications). Neurelo expands operational workflows with custom API endpoint deployment via natural language queries integrated into git-based version control (December 2024). Critical reliability barriers persist: the accuracy gap between LLM-generated and human-authored code remains significant. Industry consensus emerges: custom fine-tuning and domain-specific training data are essential; zero-shot generation inadequate for production schemas. No breakthrough in enterprise adoption; market remains characterized by research intensification and vendor optimization of non-critical use cases (rapid prototyping, mockups, low-stakes query generation).
  • 2025-Q1: Research shifts toward direct schema generation from natural language (SchemaAgent multi-agent framework with 381-pair benchmark; Nixa addresses dynamic schema discovery in multi-tenant SaaS). Vendor ecosystem expands with AI App Builder entering GA schema generation market. Open-source tools mature (GQLPT+APIPT for GraphQL/REST). Developer confidence remains low despite high adoption: Q1 2025 surveys show 90% use but 3% high trust, 66% requiring substantial modifications, accuracy across tools ranges 31–65%. Critical assessment emphasizes technical debt accumulation and systemic reliability barriers. Production deployment unchanged: non-critical experimentation only, no enterprise-grade schema adoption for critical systems.
  • 2025-Q3: GraphQL specification update (September) optimizes for AI/LLM integration with OneOf input objects and Schema Coordinates. User study (September) shows NL2SQL systems achieve 75% accuracy and 10–30% faster query completion vs. traditional SQL, but persistent user frustration with refinement cycles. Security vulnerabilities in production AI code assistants (Amazon Q Developer prompt injection/RCE, August) highlight ongoing risks. Ecosystem consolidation continues; no breakthrough in enterprise adoption. Production constraints unchanged: accuracy gaps, design quality below human baselines, security risks preclude critical system deployment.
  • 2025-Q4: Research advances in schema-aware generation (GenLink multi-model learning achieving 67.34% BIRD accuracy, first systematic normalization-impact study). Oracle releases GA GraphQL schema generation from relational databases. Production case study demonstrates API code generation from natural language with zero-shot success. Vendor ecosystem matures with Oracle and existing platforms. However, critical practitioner analyses identify four blocking issues—schema awareness gaps, accuracy limitations, poor optimization, security risks—alongside production brittleness from schema churn. Enterprise adoption for critical systems remains negligible; deployment limited to non-critical prototyping and low-stakes query generation. Accuracy and production reliability remain below thresholds for enterprise-grade schema/API generation.
  • 2026-Jan: Breakthrough in NL-to-SQL accuracy: BAR-SQL achieves 91.48% on BIRD benchmark, surpassing Claude 4.5 and GPT-5, indicating narrowing of the gap. Production deployments mature: IBM deploys zero-config NLQ-to-SQL at enterprise scale (98.7% success across 17K tables, 3.1s latency). AWS Amazon Q Developer reaches GA with SmugMug case study (100% productivity gain). However, critical barriers persist: LLM planning accuracy collapses to 30-49% with 300+ API endpoints, improving only with semantic metadata and declarative APIs. DevPals demonstrates legacy API bridging in production (60% integration TCO reduction, 90% error reduction). Patent disclosures (IBM, others) focus on semantic data layers and agentic guardrails to prevent hallucination in enterprise NL-to-SQL. Accuracy ceiling in January 2026 remains: zero-shot generation inadequate for heterogeneous schemas; semantic metadata, domain-specific fine-tuning, and constraint-based generation required for production reliability. NL-to-API remains limited to non-critical query generation, rapid prototyping, and legacy system integration.
  • 2026-Feb: Vendor ecosystem expands with AWS Bedrock structured outputs (constrained decoding for schema compliance), Oracle NetSuite N/LLM embedding native schema generation in ERP, and Apollo GraphQL agent skills for automated schema design—but each vendor acknowledgement includes caveats about AI generation quality and reliability. Real-world incident documentation surfaces schema drift patterns and API brittleness (type shifts, silent field changes causing data corruption). Practitioner testing reveals stochastic LLM API failures across Anthropic, Google, and AWS for structured output tasks. Deployment barriers persist: schema evolution causes hidden coupling; zero-shot generation inadequate; LLM reliability not deterministic. Enterprise adoption for critical schemas unchanged; non-critical prototyping and legacy bridging remain primary use cases.
  • 2026-Mar: Product ecosystem accelerates with SharpAPI, Netlify Agent Runners, and expanded Neurelo Series A funding ($5M). Real-world deployments surface: QueryLytic at B2B SaaS (schema compression, validation, multi-database support), MANTA production instances (ChemoMaker pharmacy, Manufacturing BI). Enterprise adoption metrics mature: Bank of America Erica (19.5M+ users, 100M+ requests, 30% call center reduction), Microsoft Power BI, Tableau Ask Data (63% self-service analytics increase). Constrained decoding frameworks proliferate (Guidance, Outlines, XGrammar) but JSONSchemaBench benchmark (10K schemas) reveals significant feature coverage gaps across all frameworks. Critical assessment surfaces: practitioner analysis quantifies nested JSON schema failure rates (15-25% at 3+ nesting levels); controlled research finds zero end-task success even with formal JSON schemas, indicating semantic understanding remains the bottleneck, not schema syntactic compliance. Vendor landscape confirms: production adoption accelerating for non-critical query generation and legacy API bridging, but fundamental reliability barriers persist. Schema optimization (PARSE framework) emerges as research direction, treating schema design itself as a tuning problem rather than static interface contract.
  • 2026-Apr: Bench-to-production gap widens on multiple fronts. SQLStructEval and Omni Analytics (4,602 failed queries) confirm that 81.2% of production SQL errors are semantic rather than syntactic, and GPT-5 drops from 86% on Spider 1.0 to 29% on enterprise-scale BIRD-Interact — establishing that benchmark scores overstate real-world reliability by a wide margin. dbt Labs benchmark validates the semantic layer approach: text-to-SQL at 85-90% accuracy vs 97-100% with structured semantic layer, confirming the bottleneck is schema understanding not LLM capability. AWS production deployment (Amazon Q with PostgreSQL schema generation in database migration) and normalized schema design research (16.8% QA accuracy gain from 3NF schemas) provide positive signals for constrained use cases, while structured output analysis identifies four unresolved failure layers — semantic validity and distribution shift remain outside constrained decoding's reach. Enterprise deployment evidence expanded: Microsoft engineer documented production use of Copilot Chat for database schema generation from natural language in enterprise context; schema drift documented as a critical production failure mode — healthcare case study found 12 of 28 microservices with schema drift causing silent failures until automated validation deployed; xAI shipped structured outputs GA alongside tool-calling failure analysis identifying schema mismatches and context limitations as primary root causes. Production deployment continues anchored to low-stakes use cases; enterprise-grade NL-to-schema for critical systems remains blocked by semantic reliability gaps and schema drift brittleness.
  • 2026-May: Governance patterns solidify and production scale evidence emerges alongside persistent semantic bottleneck. Uber's QueryGPT (1.2M queries/month) documents the production formula: 20+ iterations of intent classification, domain-specific workspace clustering, and context limiting—not better models—reduced query authoring from 10 to 3 minutes at scale. AutoBE GA ships complete backend generation (Prisma schema, OpenAPI specs, NestJS) from conversational requirements via 40+ specialized agents with 85-90% success rates and 100% compilation guarantee, establishing production viability for non-critical backend scaffolding. Bytebase synthesis confirms deterministic governance (context limiting, structured evaluation, validation layers) as the success pattern across OpenAI, Google Cloud, Vercel, and Hex production deployments. DivSkill-SQL research achieves +11.1 pts on Snowflake and +8.3 on BigQuery with 3x fewer hallucinated schema references via agentic ensemble optimization. Structured Output Benchmark quantifies core reliability challenge: LLMs produce syntactically valid JSON with semantically incorrect hallucinated values. Security analysis identifies 10 production risks (hallucinated schema, PII exposure, cost explosions) requiring deterministic validation pipelines. Semantic context (business rules, glossaries, descriptions) confirmed as the bottleneck across independent studies—near-zero accuracy without metadata enrichment. Enterprise adoption for critical systems unchanged; deployment anchored to prototyping, legacy API bridging, and exploratory analytics with human-in-loop validation.
  • 2026-Jun: Vendor ecosystem and negative-signal research both accelerate. Microsoft released GitHub Copilot PostgreSQL extension with GA NL-to-DDL generation (@pgsql prompts generating table creation and schema modifications), confirming tier-1 IDE vendors treat schema generation as production-ready feature. SANE research validates schema-aware approach: LLMs reliably generate SQL schemas from natural language when given schema constraints and structured prompting, no fine-tuning required—establishing guardrails as the differentiator, not model scale. FastAPI production templates document model-specific challenges: Claude Sonnet excels at async patterns while ChatGPT falls back to deprecated Pydantic v1 syntax 40% of the time, requiring explicit prompt engineering. Critical reliability research documents constraint decay in agentic code generation: 30+ point drop in assertion pass rates from baseline to fully constrained production task; ceiling effect observed where agent performance collapses rather than gracefully degrade. Industry benchmark of 34 LLMs on text-to-SQL reveals persistent 20%+ error rates on complex queries from incomplete parsing, hallucinated columns, and constraint mapping failures. Agentic technical debt framework formalizes operational cost structure: probabilistic systems incur recurring stochastic tax independent of debt accumulation (tool contracts, routing logic, governance). Production deployment patterns unchanged: schema-aware approaches enable higher accuracy, governance layers prevent hallucination damage, but zero-shot generation remains inadequate for heterogeneous enterprise schemas. Enterprise adoption for critical systems remains constrained by semantic understanding bottleneck and operational complexity.