The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI generating API endpoints, database schemas, or data models from natural language descriptions of requirements. Includes REST/GraphQL API scaffolding and database schema design; distinct from infrastructure-as-code which targets deployment resources rather than application interfaces.
Generating API endpoints and database schemas from natural language remains an experimental capability despite rapid vendor investment. The promise is clear -- describe what you need in plain language and get a working REST endpoint, GraphQL schema, or SQL query -- but production reliability has not caught up. Benchmark accuracy on NL-to-SQL has climbed sharply (the BAR-SQL framework reaches 91.48% on BIRD, up from 40% two years ago), yet real-world deployment tells a fundamentally different story. Text-to-SQL accuracy collapses from 85-90% in abstract benchmarks to 29% on real enterprise-scale queries (BIRD-Interact); production systems fail silently with plausible-looking outputs that mask semantic errors. The reliability gap stems from four distinct failure layers: syntax (mostly solved), schema compliance (constrained decoding partially addresses), semantic validity (unresolved), and distribution shift (unresolved). Developer surveys document only 3% high trust in AI-generated schema code despite 90% adoption, with two-thirds of outputs requiring substantial modification. Isolated enterprise successes exist, but they depend on heavy guardrails -- semantic metadata layers, domain-specific fine-tuning, and constraint-based decoding -- rather than zero-shot generation. The defining tension for this bleeding-edge practice is that the fundamental bottleneck is semantic understanding, not schema formatting. Until that gap closes, adoption stays confined to prototyping, legacy bridging, and low-stakes query generation.
May 2026 evidence confirms the bench-to-production gap remains the critical barrier. Bytebase synthesis of four vendor case studies (OpenAI, Google Cloud, Vercel, Hex) documents a consistent pattern: enterprise text-to-SQL success depends not on model capability but on governance infrastructure — context limiting, rigorous evaluation, and deterministic validation layers. "Giving the system everything made it worse. Giving it less, but making that 'less' very clear, made it usable." The core problem persists: raw LLM output for schema/SQL generation requires intensive downstream validation before deployment. Getcollate's analysis of benchmark artifacts shows the gap starkly — 85%+ accuracy on curated Spider 1.0 collapses to 10.1% on real enterprise Spider 2.0 for GPT-4o, demonstrating that semantic context (business rules, metric definitions, documentation) is the limiting factor, not LLM architecture. Structured Output Benchmark research (April 2026) reveals the foundational challenge: LLMs produce syntactically valid JSON schemas with semantically incorrect hallucinated values, and this gap cannot be closed by constrained decoding alone.
Production deployments exist but remain confined to narrow use cases. Microsoft's ISE blog documents metadata enrichment as the critical infrastructure investment — metadata-starved systems fail at scale; custom implementations achieve ~75% accuracy with explicit column descriptions and domain context. Spotify's May 2026 natural language interface for ad campaign management demonstrates production NL-to-API with reported 85% user adoption, but this deployed as a convenience layer simplifying API calls rather than autonomous generation. Multi-turn text-to-SQL (Rose-SQL framework) achieves SOTA on SParC/CoSQL benchmarks, and AutoLink achieves 97.4% schema recall at enterprise scale (3000+ columns), signaling that structured reasoning and iterative exploration outperform zero-shot generation. Yet these advances remain contained within research and specialized use cases.
Security and governance become operational requirements at scale. dpriver analysis identifies 10 production risks (hallucinated schema, unauthorized access, PII exposure, cost explosions) requiring deterministic validation layers with parse-check-approve-audit pipelines, not prompts. Schema drift emerges as the critical blocking issue: dependency updates change API response formats silently, causing hidden behavioral regression in agents trained on stale schemas. LLM structured-output reliability remains stochastic across providers — some calls succeed, others fail, with no deterministic recovery mechanism. Production deployment remains anchored to non-critical use cases: rapid prototyping (high refinement overhead accepted), legacy API bridging (shallow, stable schemas), and exploratory analytics (queries discarded if wrong).
— Synthesis of OpenAI, Google Cloud, Vercel, and Hex case studies documenting that enterprise text-to-SQL success requires governance layers (validation, access control, audit), not better prompts—establishing maturity pattern.
— Microsoft vendor case study on LiveSQLBench: metadata enrichment (column descriptions, domain context) critical to accuracy; ~75% achieved with custom implementation, demonstrating infrastructure requirements for production.
— Spotify deployed production NL interface for advertisers to manage ad campaigns via plain language instead of manual API calls, reports 85% adoption; catalogs 30+ examples from tier-1 product teams showing production NL-to-API maturity.
— Training-free framework using small-scale LRMs for multi-turn text-to-SQL without expensive fine-tuning; SOTA performance on SParC/CoSQL, showing smaller models viable for production with structured reasoning.
— Identifies 10 production security risks (hallucinated schema, unsafe statements, unauthorized access, PII exposure) requiring deterministic validation layer, not prompts—blocking wider production adoption.
— Multi-source benchmark evaluating LLM structured output quality across 7 metrics (JSON compliance, value accuracy, faithfulness); identifies critical gap between syntactic validity and semantic correctness in schema generation.
— Benchmarks reveal 85% accuracy on curated Spider 1.0 vs 10.1% on real-world Spider 2.0 for GPT-4o; semantic context gaps, not LLM capability, determine success—showing benchmark artifacts mask production reliability.
— AAAI 2026 agent framework for enterprise-scale schema linking; 97.4% recall on Bird-Dev, 91.2% on Spider 2.0 while handling 3000+ column schemas, demonstrating production-viable approach to enterprise complexity.