The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI generation of natural-sounding speech from text for audiobooks, accessibility, navigation, and content delivery. Includes multi-language synthesis and emotional expression; distinct from voice cloning which replicates specific voices rather than generating generic natural speech.
The vendor landscape has stratified by performance envelope and use-case. Sub-300ms P95 latency tier (real-time conversational): Cartesia 40–90ms (SSM architecture, Sonic-Turbo), OpenAI Realtime-2 (May 7 launch with speech-to-speech), LiveKit+Gemini (295ms P95), Gradium Sonic (155ms P50), Inworld TTS-2 (<130ms P90 on Mini tier). Batch/content tier: ElevenLabs (41% Fortune 500, 2M agents, $500M+ ARR post-Series D, $11B valuation), Amazon Polly (31 generative voices, AWS SageMaker JumpStart integration June 2026), Azure (400+ voices, 140+ languages, MAI-Voice-2 zero-shot cloning June 2). Efficiency tier: Inworld TTS-1 Max ($10/M chars, ELO 1,162), Minimax Speech 2.8 HD (86.2% approval, ELO 1,107), PlayHT (85.6%), WellSaid Labs (82%). Open-source maturity: Higgs Audio v3 (Boson AI, 4B-parameter, 100+ languages at 3.61% WER, inline emotion/style control), Chatterbox-Turbo (65.3% preference over ElevenLabs Turbo v2.5 in blind tests), Kokoro ($0.70/M chars), Fish Speech, CosyVoice 2.0. Independent evaluation platforms (Coval, LMSYS, Artificial Analysis, Voice AI Leaderboard) standardize benchmarking; Coval emphasizes vendor benchmarks unreliable on P95 latency and domain-specific pronunciation. Vendor-published P50 metrics systematically outperform real production P95/P99 under load.
Production deployment spans content (5.2x monthly growth in independent audiobook production via Inkfluence, ACX policy enabling AI narration June 2026), accessibility (7.5M K-12 students under US IDEA), healthcare (30M clinician minutes via voice agents), customer service (Klarna 10x resolution, Mahindra 8% conversion, Revolut 31 languages, Berlin-Brandenburg Airport zero-wait), and emerging enterprises (Cisco Webex, NVIDIA, TELUS Digital). Latency benchmarks show streaming TTS first-chunk 60–100ms achievable, but P95 end-to-end latency under 500ms requires optimized architecture and co-located inference; 100+ concurrent stream degradation (800ms P95) remains unresolved. Emotional expressiveness and zero-shot cloning now standard features across major vendors (Microsoft MAI-Voice-2, ElevenLabs v3, Google Gemini 3.1 Flash). Architectural innovation ongoing (flow-matching, diffusion, SSM-based designs achieving sub-100ms synthesis). However, adoption barriers intensify: consumer preference (Audio Publishers: 16% tried AI audiobooks, AI revenue 0.03% of market, willingness dropped 70%→61% YoY), quality ceilings (Respeecher: TTS unsuitable for performance-driven content due to unsolved prosody, emotion, code-switching, multispeaker coherence), regulatory risk (BIPA litigation documenting consent violations, precedent damages $100M–$2B+), operational instability (recurring platform incidents, regional failures), regional constraints (emerging markets face 800ms–1.4s latency penalties, language quality gaps, compliance misalignment), and vendor lock-in (94% enterprise concern, 2.3–5.7x switching costs). Evaluation methodology now standardized across platforms; quality commoditization at table-stakes tier (MOS 4.2–4.3) has shifted competitive differentiation to latency consistency, pronunciation accuracy (1–3% WER variance), and production reliability.
— UK government MoU with ElevenLabs commits to AI voice across public services at national scale; deployment requires multi-model orchestration across 300+ languages with validation routing; signals government-grade production requirements and vendor credibility for public-sector scale.
— Interspeech 2026 advances TTS naturalness through non-verbal vocalization support (laughter, sighs) with speaker identity preservation; 22.66% speech-NVV EER vs 38.93% baseline; moves beyond phonetic speech into expressive sound generation.
— Interspeech 2026 peer-reviewed evaluation of 17 TTS systems across 193 speakers for speech disorder voice reconstruction; identifies evaluation methodology gaps (MOS limited sensitivity) and demonstrates accessibility deployment maturity.
— ACX maintains gatekeeping on independent author AI submissions; only narrator voice replicas (opt-in per-title) permitted; no published timeline for third-party TTS acceptance; documents regulatory and platform friction limiting audiobook TTS adoption velocity.
— Critical adoption barrier signal: AI-narrated audiobooks account for 0.03% of $2.43B market; consumer willingness to try AI voices declined 70%→61% YoY despite technology availability, indicating quality/preference barriers dominate over technical capability.
— NVIDIA containerizes TTS as production microservice (NIM) with published benchmarks: 55–70ms first-chunk latency on L40/H100, sub-100ms inter-chunk on A100; signals infrastructure-layer commoditization of TTS across cloud GPU platforms.
— Comprehensive ecosystem analysis documenting shift from naturalness to expressive control/realtime/local privacy; covers 8+ vendor releases (Microsoft MAI-Voice-2, Google Gemini 3.1, AWS SageMaker, Inworld, Soniox), open-source models, and community pain points (hallucination, dropped words, unnatural turn-taking).
— ElevenLabs enterprise product GA with two-platform architecture (Creative for media, Agents for conversational AI), SOC2/GDPR/HIPAA compliance, 30+ integrations, customer case studies (Klarna 10X resolution, Revolut 31 languages) demonstrating Fortune 500 deployment scale.