The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI generating API endpoints, database schemas, or data models from natural language descriptions of requirements. Includes REST/GraphQL API scaffolding and database schema design; distinct from infrastructure-as-code which targets deployment resources rather than application interfaces.
Generating API endpoints and database schemas from natural language remains an experimental capability despite rapid vendor investment and visible production deployments. The promise is clear—describe what you need and get a working REST endpoint, GraphQL schema, or SQL query—but production reliability reveals a persistent bench-to-reality gap. Benchmark accuracy on NL-to-SQL has climbed sharply, yet real-world deployment tells a different story. Frontier models achieve 85-90% accuracy on curated benchmarks (Spider) but collapse to 10-29% on production-scale schemas (BEAVER, BIRD-Interact, enterprise deployments). The reliability gap stems from four distinct failure layers: syntax (mostly solved), schema compliance (constrained decoding partially addresses), semantic validity (unresolved), and distribution shift (unresolved). Production deployments exist (Uber's QueryGPT processing 1.2M queries/month, AutoBE shipping 85-90% success rates, AWS and Spotify tier-1 integrations), but succeed only under heavy guardrails—metadata enrichment, domain-specific fine-tuning, intent classification layers, and test-driven validation pipelines—rather than zero-shot generation. Developer trust remains low: 90% adoption yet only 3% high trust, two-thirds of outputs requiring substantial modification. The defining tension for this bleeding-edge practice is that the fundamental bottleneck is semantic understanding (business logic, metric definitions, implicit relationships) and operational maturity (validation, governance, schema drift management), not schema formatting capability. Until semantic understanding and governance infrastructure close the gap, adoption stays confined to prototyping, legacy bridging, and low-stakes query generation.
Late May 2026 evidence confirms the bench-to-production gap is the persistent critical barrier across independent assessments. MIT/Intel/Harvard's BEAVER benchmark (May 2026) establishes the starkest signal: frontier models achieve 10.8% accuracy on proprietary real-world enterprise schemas despite 82% on academic Spider, a 90% failure rate demonstrating the gap is fundamental, not marginal. Real-world deployments document the production formula: Uber's QueryGPT (1.2M queries/month) succeeded through 20+ iterations of intent classification, domain-specific workspace clustering, and context limiting—not better models. Amazon Science's SQL-Trail shows multi-turn reinforcement learning with iterative feedback (not scale) enables 7B/14B models to outperform substantially larger systems by 5%. AutoBE (production tool) ships 85-90% success rates on real examples via AST-based validation, 100% compilation guarantees, and explicit limitations on runtime behavior requiring human review.
Vendor ecosystem maturity is evident across tier-1 platforms. AWS Amazon Q Developer reached GA with production deployments at SmugMug and TCS. Spotify's natural language interface for ad management (May 2026) reports 85% adoption but explicitly deployed as convenience layer, not autonomous generation. Schema awareness now separates viable tools from failures: Analytics Insight's 2026 benchmarking shows schema-connected tools achieve 64-90% accuracy while schema-agnostic approaches invent columns and fail entirely.
The semantic bottleneck and governance requirements dominate production constraints. Bytebase's synthesis of OpenAI, Google Cloud, Vercel, and Hex case studies establishes governance as the success pattern: context limiting, rigorous evaluation, deterministic validation. dpriver identifies 10 production risks (hallucinated schema, unauthorized access, PII exposure) requiring deterministic validation pipelines, not prompts. Schema drift emerges as critical blocking issue—silent field changes cause hidden behavioral regression in agents trained on stale schemas (healthcare case study: 12 of 28 microservices). Governance tools mature: constrained decoding frameworks (Guidance, Outlines, XGrammar) proliferate but JSONSchemaBench reveals significant feature coverage gaps across all frameworks.
Production deployment remains anchored to non-critical use cases: rapid prototyping, legacy API bridging, and exploratory analytics. Practitioners document the reality: 78% zero-shot accuracy means 1-in-5 queries is wrong, often silently. Evaluation maturity emerges: benchmarks report 90%+ but real-world Execution Accuracy drops to 51%, with Snowflake Cortex achieving >90% exception through semantic layers and curated context. Domain-specific, low-resource settings (the production norm) lack annotated training data, creating a chicken-and-egg problem that knowledge distillation frameworks are beginning to address.
— Peer-reviewed research: LLMs reliably generate SQL schemas from NL when given schema constraints and structured prompting, no model training required. Confirms schema awareness and guardrails enable production-viable generation.
— Industry benchmark of 34 LLMs on text-to-SQL with error analysis: 20%+ error rates on complex queries; failures stem from incomplete request parsing, hallucinated columns, and constraint mapping failures. Validation essential for production.
— Microsoft GA product feature: Copilot generates PostgreSQL schema modifications and SQL from NL prompts (e.g., 'convert the hr.employees table to use a JSONB column'). Demonstrates production NL-to-schema generation in tier-1 IDE.
— Production guide for AI-generated FastAPI code and Pydantic schemas from NL specifications. Model analysis: Claude Sonnet excels at async patterns; ChatGPT generates deprecated Pydantic v1 syntax. Explicit prompt guidance prevents AI fallback to v1.
— CRITICAL SIGNAL: Agentic code generation loses 30+ points in assertion pass rates as structural constraints accumulate. LLM agents pass unit tests but violate runtime ORM contracts; constraint ceiling exists, not graceful degradation.
— Formal framework distinguishing technical debt from recurring stochastic tax in probabilistic agentic workflows. Quantifies operational cost structure for API/schema generation systems, identifying tool/schema debt and governance debt vectors.
— Uber's QueryGPT case study: 1.2M queries/month processed, reduced query authoring from 10 minutes to 3 minutes through intent agents and domain-specific workspace clustering, demonstrating production evolution and schema scaling challenges.
— Production tool generating complete backends (Prisma schema, OpenAPI specs, NestJS) from conversational requirements; 40+ specialized agents, 100% compilation guarantee, 85-90% success rates on real-world examples.