The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI generation of natural-sounding speech from text for audiobooks, accessibility, navigation, and content delivery. Includes multi-language synthesis and emotional expression; distinct from voice cloning which replicates specific voices rather than generating generic natural speech.
Neural text-to-speech has crossed the quality threshold — models now match human speech on perceptual benchmarks — and enterprise adoption is accelerating with quantified business outcomes. May 2026 deployment evidence shows 8% conversion uplifts (Mahindra auto sales), 80%+ cost reduction in content production (Spoonlabs audiobooks from 4-7 months to hours), and measurable accessibility gains (meta-analysis d=0.35 on reading comprehension). The market is bifurcating: premium vendors (ElevenLabs at $330M ARR 2026, 41% Fortune 500 penetration; Microsoft Azure with 400+ voices across 140+ languages) compete on reliability and ecosystem breadth; efficiency-first startups (Inworld $10/M chars, Gradium at 155ms latency) disrupt on cost and speed while matching quality; open-source alternatives (Kokoro, Mistral Voxtral, Qwen3) achieve sub-$1 production cost with near-parity naturalness. Latency compression continues: May 2026 benchmarks show sub-100ms achievable (Gradium, Cartesia), but streaming accuracy still degrades at scale (800ms at 100 concurrent); WER variance across vendors narrowed to 1-3%. Reliability remains inconsistent (ElevenLabs API incidents, Azure multilingual failures recurring), and vendor lock-in costs (2.3-5.7x) constrain adoption beyond specialist use cases. Evaluation methodology gaps emerge: benchmarks fragmented, metrics saturated, no standardization. The practice sits firmly at leading-edge: quality validated by peer review, cost economics compressed (200x pricing variance despite commoditized MOS), deployment viability proven in high-volume and accessibility workflows, but reliability, evaluation standardization, and cost economics prevent mainstream real-time interactive adoption.
The vendor landscape has fragmented. Premium platforms: Amazon Polly (31 generative voices, 20 locales), Azure Neural TTS (400+ voices, 140+ languages), ElevenLabs ($330M ARR 2025, $11B valuation, 41% Fortune 500 penetration, 2M+ agents deployed). Efficiency challengers: Inworld TTS-1 Max ($10/M chars, ELO 1,162 ranking first), PlayHT (85.6% approval), Minimax (86.2% approval, Speech-02-Turbo ELO 1,107), WellSaid Labs (82% approval). Independent 10,000-listener benchmark shows AI-native startups outperforming Big Tech on quality; overall approval at 67%, with 34% AI detection rate signaling authenticity gap. Open-source maturation: Mistral Voxtral Mini (in-browser, <500ms latency, Apache 2.0), Qwen3-TTS (10 languages, 97ms latency), Kokoro 82M ($0.70/M chars, ELO 1,059, <50ms). Market projects $5.1B (2025) → $43.8B (2034), 28.6% CAGR, with voice AI applications spanning customer service, audiobooks, media, e-learning, and gaming.
Deployment evidence spans content, accessibility, healthcare, finance, and customer service with quantified outcomes. May 2026 specific cases: Mahindra & Mahindra reports 8% conversion uplift on XUV 7XO launch using ElevenLabs voice agents; Spoonlabs (South Korea) reduced audio novel production from 4-7 months (human voice actors) to hours using ElevenLabs, enabling simultaneous multilingual production (Korea, Japan, Taiwan). Healthcare and accessibility evidence: meta-analysis shows TTS effect on reading comprehension (d=0.35) across 7.5M special education students under US IDEA. Audiobook market: 80%+ cost reduction via ElevenLabs at 30,000+ hours (Pocket FM), independent creators publishing 1-2 titles monthly at $99/month vs. $2-5k human narration. Technical maturity: May 2026 latency benchmarks show Gradium TTS at 155ms TTFA (P50) with 2ms IQR and 3.3% WER—lowest latency with no quality tradeoff; competitive spread shows Deepgram (313ms), ElevenLabs (264ms), OpenAI unsuitable (2,295ms for real-time). However, adoption barriers persist: streaming accuracy degrades at 100 concurrent streams (800ms latency); reliability incidents recurring (ElevenLabs API failures, Azure multilingual errors); 94% enterprise concern about vendor lock-in (2.3x-5.7x switching costs); only 29% willing to pay AI premium; evaluation methodology fragmented with no standardization (MOS saturation, benchmark proliferation). Result: market where quality is commoditized, costs collapsing to 200x variance across vendors (OpenAI $15/M to ElevenLabs $206/M), reliability remains inconsistent under scale, and open-source ($0.70/M) threatens vendor margins.
— Comprehensive model aggregation platform comparing multiple production TTS systems with benchmarked latency, quality, and language support metrics across competing vendors.
— Peer-reviewed evidence: meta-analysis showing TTS effect on reading comprehension (d=0.35), TAM of 7.5M special education students in US under IDEA, validating accessibility and learning outcomes.
— Named organization (Mahindra & Mahindra, Indian auto manufacturer) deployed ElevenLabs voice agents for outbound sales during XUV 7XO launch with documented 8% conversion uplift.
— Expert evaluation methodology critique from Inworld AI Head of Evaluations, identifying benchmark saturation, metric fragmentation, and standardization gaps limiting TTS assessment reliability.
— Named organization (Spoonlabs, South Korean audio platform) deployed ElevenLabs for audio novel production, reducing production time from 4–7 months (voice actors) to a few hours.
— Independent Coval + Gradium benchmark of 9 TTS models on Time-to-First-Audio latency; Gradium TTS achieves 155ms P50 with 2ms IQR, lowest latency with measurable quality tradeoffs.
— Dual-source TTS pronunciation accuracy benchmark across 9 models; Gradium TTS achieves 3.3% WER (Coval) and 1.11% WER (MiniMax), demonstrating no quality/speed tradeoff at scale.
— Comprehensive guide to TTS evaluation; traces architecture evolution through 4 generations (concatenative→HMM→neural→diffusion), defines quality metrics (MOS 4.0+), and selection framework for production.