Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Text-to-speech — natural voice synthesis

LEADING EDGE

TRAJECTORY

Advancing

AI generation of natural-sounding speech from text for audiobooks, accessibility, navigation, and content delivery. Includes multi-language synthesis and emotional expression; distinct from voice cloning which replicates specific voices rather than generating generic natural speech.

CURRENT LANDSCAPE

The vendor landscape has stratified by performance envelope and use-case. Sub-300ms P95 latency tier (real-time conversational): Cartesia 40–90ms (SSM architecture, Sonic-Turbo), OpenAI Realtime-2 (May 7 launch with speech-to-speech), LiveKit+Gemini (295ms P95), Gradium Sonic (155ms P50), Inworld TTS-2 (<130ms P90 on Mini tier). Batch/content tier: ElevenLabs (41% Fortune 500, 2M agents, $500M+ ARR post-Series D, $11B valuation), Amazon Polly (31 generative voices, AWS SageMaker JumpStart integration June 2026), Azure (400+ voices, 140+ languages, MAI-Voice-2 zero-shot cloning June 2). Efficiency tier: Inworld TTS-1 Max ($10/M chars, ELO 1,162), Minimax Speech 2.8 HD (86.2% approval, ELO 1,107), PlayHT (85.6%), WellSaid Labs (82%). Open-source maturity: Higgs Audio v3 (Boson AI, 4B-parameter, 100+ languages at 3.61% WER, inline emotion/style control), Chatterbox-Turbo (65.3% preference over ElevenLabs Turbo v2.5 in blind tests), Kokoro ($0.70/M chars), Fish Speech, CosyVoice 2.0. Independent evaluation platforms (Coval, LMSYS, Artificial Analysis, Voice AI Leaderboard) standardize benchmarking; Coval emphasizes vendor benchmarks unreliable on P95 latency and domain-specific pronunciation. Vendor-published P50 metrics systematically outperform real production P95/P99 under load.

Production deployment spans content (5.2x monthly growth in independent audiobook production via Inkfluence, ACX policy enabling AI narration June 2026), accessibility (7.5M K-12 students under US IDEA), healthcare (30M clinician minutes via voice agents), customer service (Klarna 10x resolution, Mahindra 8% conversion, Revolut 31 languages, Berlin-Brandenburg Airport zero-wait), and emerging enterprises (Cisco Webex, NVIDIA, TELUS Digital). Latency benchmarks show streaming TTS first-chunk 60–100ms achievable, but P95 end-to-end latency under 500ms requires optimized architecture and co-located inference; 100+ concurrent stream degradation (800ms P95) remains unresolved. Emotional expressiveness and zero-shot cloning now standard features across major vendors (Microsoft MAI-Voice-2, ElevenLabs v3, Google Gemini 3.1 Flash). Architectural innovation ongoing (flow-matching, diffusion, SSM-based designs achieving sub-100ms synthesis). However, adoption barriers intensify: consumer preference (Audio Publishers: 16% tried AI audiobooks, AI revenue 0.03% of market, willingness dropped 70%→61% YoY), quality ceilings (Respeecher: TTS unsuitable for performance-driven content due to unsolved prosody, emotion, code-switching, multispeaker coherence), regulatory risk (BIPA litigation documenting consent violations, precedent damages $100M–$2B+), operational instability (recurring platform incidents, regional failures), regional constraints (emerging markets face 800ms–1.4s latency penalties, language quality gaps, compliance misalignment), and vendor lock-in (94% enterprise concern, 2.3–5.7x switching costs). Evaluation methodology now standardized across platforms; quality commoditization at table-stakes tier (MOS 4.2–4.3) has shifted competitive differentiation to latency consistency, pronunciation accuracy (1–3% WER variance), and production reliability.

TIER HISTORY

ResearchJan-2018 → Jan-2018
Bleeding EdgeJan-2018 → Jan-2019
Leading EdgeJan-2019 → present

EVIDENCE (163)

— UK government MoU with ElevenLabs commits to AI voice across public services at national scale; deployment requires multi-model orchestration across 300+ languages with validation routing; signals government-grade production requirements and vendor credibility for public-sector scale.

— Interspeech 2026 advances TTS naturalness through non-verbal vocalization support (laughter, sighs) with speaker identity preservation; 22.66% speech-NVV EER vs 38.93% baseline; moves beyond phonetic speech into expressive sound generation.

— Interspeech 2026 peer-reviewed evaluation of 17 TTS systems across 193 speakers for speech disorder voice reconstruction; identifies evaluation methodology gaps (MOS limited sensitivity) and demonstrates accessibility deployment maturity.

— ACX maintains gatekeeping on independent author AI submissions; only narrator voice replicas (opt-in per-title) permitted; no published timeline for third-party TTS acceptance; documents regulatory and platform friction limiting audiobook TTS adoption velocity.

— Critical adoption barrier signal: AI-narrated audiobooks account for 0.03% of $2.43B market; consumer willingness to try AI voices declined 70%→61% YoY despite technology availability, indicating quality/preference barriers dominate over technical capability.

— NVIDIA containerizes TTS as production microservice (NIM) with published benchmarks: 55–70ms first-chunk latency on L40/H100, sub-100ms inter-chunk on A100; signals infrastructure-layer commoditization of TTS across cloud GPU platforms.

— Comprehensive ecosystem analysis documenting shift from naturalness to expressive control/realtime/local privacy; covers 8+ vendor releases (Microsoft MAI-Voice-2, Google Gemini 3.1, AWS SageMaker, Inworld, Soniox), open-source models, and community pain points (hallucination, dropped words, unnatural turn-taking).

— ElevenLabs enterprise product GA with two-platform architecture (Creative for media, Agents for conversational AI), SOC2/GDPR/HIPAA compliance, 30+ integrations, customer case studies (Klarna 10X resolution, Revolut 31 languages) demonstrating Fortune 500 deployment scale.

HISTORY

  • 2018: Cloud TTS services (Google, Amazon, Microsoft) reach production; transformer-based models approach human quality; early deployment in accessibility and call centres; vendor lock-in and cost remain adoption barriers.
  • 2019: Amazon Polly adds Neural TTS with expressive styles (newscaster, conversational); FastSpeech research demonstrates non-autoregressive speedup for industrial deployment; government and CPaaS platforms adopt neural TTS at scale; production reliability issues surface (SSML timeouts, latency under load).
  • 2020: Major publishers (Washington Post, NYT, Economist) deploy TTS for audio articles with 3x engagement uplift; Azure expands to 206 voices (129 neural); accessibility compliance (Ofcom EPG) drives adoption; research advances controllable prosody and noise robustness; reliability issues persist (service disruptions, integration gaps, emotional expression limitations remain unfixed).
  • 2021: Microsoft Uni-TTSv4 achieves human-parity quality (MOS 4.29); Azure TTS embeds in Outlook, Edge, Word at scale; UniTA innovation reduces pronunciation errors 50%+; enterprise deployments (BBC, Progressive, Swisscom) accelerate via Azure TTS; however, Google Cloud TTS shows 22% SSML failure rate, integration challenges persist (auth, token expiry), and emotional expression limitations block broader use in audiobooks and dubbing.
  • 2022-H1: Azure expands to 129 languages with 36 new preview voices; Interspeech 2022 advances emotional expression via GPT-3 emotion prediction and text-driven style transfer, moving toward solving the frontier limitation. Amazon deploys DEI pronunciation tool at scale. Market growth accelerates (projected $5.61B by 2028). Platform reliability issues continue (Google Speech Services affecting millions, authentication failures in open-source integrations), constraining expansion beyond high-volume enterprise and content publisher workflows.
  • 2022-H2: Microsoft releases contextual voice model (Roger) for long-form content, advancing paragraph-level prosody control for audiobooks and video dubbing. Azure upgrades 400+ voices to 48kHz with HiFiNet2 vocoder for improved fidelity. Google Play Books deploys auto-narration across 8 countries, addressing content creation gap at scale. Research confirms TTS social acceptability (IVA study), emotional perception (EEG study), and deployment viability despite classroom learning performance gaps. Product maturation continues across all major cloud platforms with focus on expressiveness and long-form content quality.
  • 2023-H1: Azure adds low-resource TTS enhancements for accessibility; Google Cloud TTS expands language support. However, adoption slowdown emerges: Speech Technology Magazine reports slower enterprise ramp-up, skill shortages, and vendor interoperability challenges. Google Cloud TTS experiences SSML timepointing regression. Vendor lock-in concerns surface in procurement discussions. Market projected at $5.61B by 2028 (22.5% CAGR) but gap widens between headline growth and actual enterprise adoption, indicating practice plateau despite technical maturity.
  • 2023-H2: AWS launches expressive long-form engine (Polly) with three new voices advancing naturalness for audiobooks; Azure reduces batch TTS pricing 64% (to $0.36/hr) and adds language ID/diarization, driving adoption economics. Research advances emotion control in dialogue systems. Market analysis confirms sustained growth ($3.87B→$7.92B by 2031, 12.66% CAGR; neural TTS 67.18% revenue share, 15.08% CAGR), but technical analysis surfaces persistent challenges: naturalness, emotion control, accent variability, computational limits, and privacy concerns constraining mainstream enterprise adoption beyond accessibility and content publishing workflows.
  • 2024-Q1: Research advances controllable TTS with natural language-guided synthesis (45k-hour datasets) and rhyme-based systems improving naturalness and speed; Azure upgrades Personal Voice zero-shot models. Multi-industry enterprise deployments documented (banking, e-commerce, telecom) with quantified efficiency gains (40% call reduction, 25-second account checks). Independent benchmarking evaluates competitive TTS landscape (Polly Generative ELO 1057.01). Platform maturity continues through research integration with LLMs and controllability innovations, but real-world deployment barriers persist (Azure feature restrictions, cloud service reliability gaps, vendor lock-in).
  • 2024-Q2: Amazon Polly launches generative TTS engine with advanced prosody control (Ruth, Matthew, Amy voices); research advances zero-shot efficiency (VALL-E R: 60% inference reduction) and evaluation methods. Pocket FM demonstrates production-scale TTS adoption via ElevenLabs with 30,000 hours processed at 90% cost reduction, validating economics for high-volume content creation. Community benchmarking tools emerge for vendor comparison. However, Azure TTS reports intermittent latency spikes (5-24 seconds) in production conversational systems, highlighting persistent cloud service reliability barriers. TTS consolidates in high-volume, batch-oriented workflows but real-time interactive applications remain constrained by latency and cost.
  • 2024-Q3: Vendor platform maturation continues: Amazon Polly expands to Czech and Swiss German voices; Microsoft previews HD voices with emotion detection and contextual adaptation using transformer models. BlackHat Labs deploys ElevenLabs TTS for conversational DJ Khaled chatbot with 120k concurrent users, 40% session uplift, and sub-200ms latency, demonstrating TTS viability at enterprise scale for interactive applications. However, open-source TTS integration challenges persist—Hugging Face community reports voice breaks and latency issues in production pipelines. By Q3 end, TTS remains primarily viable for high-volume batch workflows (content production, publishing) with emerging evidence of real-time interactive viability at scale, contingent on vendor reliability improvements.
  • 2024-Q4: Vendor innovation accelerates: Amazon Polly launches 13 new generative voices (6 in early November, 7 in late November) across English, French, Spanish, German, and Italian, with polyglot language switching, expanding generative engine to twenty voices. Research advances multilingual TTS efficiency (FPT AI: 15% latency reduction, 12% WER improvement, 7% emotion accuracy gains). Adoption broadens across sectors: 67% of U.S. K-12 schools use TTS for accessibility, 90% of 2024 vehicles feature voice interfaces, 26% annual audiobook market growth, 68% of European enterprises accelerating adoption for 2025 EAA compliance. Accessibility libraries (DAISY Consortium: 31 services) prioritize TTS for transcription workflows. However, Azure reliability issues persist: multilingual voice synthesis (de-DE-Florian, fr-FR-Remy) failures reported with 400 errors in October-November. By Q4 end, TTS consolidates around vendor platforms for high-volume, regulated (accessibility, automotive), and content-creation workflows, with demonstrated ROI in publishing and customer service, while cloud service reliability and cost economics remain barriers to interactive real-time deployment beyond specialist applications.
  • 2025-Q1: Research continues refining emotional expressiveness: EmoVoice demonstrates LLM-based approach to fine-grained emotion control using phoneme boosting and multimodal evaluation (GPT-4o-audio, Gemini). However, Q1 shows sparse deployment evidence and no major vendor product launches, suggesting market consolidation phase. Voice synthesis remains mature for accessibility and batch workflows; real-time interactive applications show promise but depend on sustained vendor reliability improvements and latency optimization.
  • 2025-Q2: Vendor innovation accelerates despite market maturity: Microsoft launches Azure Neural HD voices (DragonHD/DragonHDOmni with 700+ voices) with emotion detection and sub-300ms latency (April); Telnyx and Genesys integrate Azure HD and Polly respectively, expanding ecosystem breadth. Independent audiobook creators adopt ElevenLabs for commercial production (1-2 books monthly), validating TTS ROI for low-barrier content creation. However, persistent adoption barriers emerge: practitioner analysis highlights Polly's lack of customization and cost inefficiency at scale; ElevenLabs API reliability issues surface (partial outage, April 15). Market analysis projects AI voice agent segment at $2.4B (2024) with 34.8% CAGR through 2034, but TTS capabilities increasingly commoditized. By Q2 end, TTS consolidates around specialized high-volume, accessibility, and content-creation use cases with mature vendor platforms, while cloud service reliability, cost economics, and vendor lock-in remain primary barriers to broader interactive and enterprise deployment.
  • 2025-Q3: Vendor platform maturation accelerates: Amazon Polly expands generative voices to twenty-seven with new polyglot English/French/Polish/Dutch voices (August); research community advances real-time latency optimization through service-oriented architectures. ElevenLabs achieves $200M+ ARR with 41% Fortune 500 penetration and 2M+ conversational agents deployed, demonstrating market consolidation and significant enterprise adoption of voice AI at scale. However, reliability challenges persist: ElevenLabs experiences production incidents (webhook failures July 9, ASR capacity issues September 19), signaling operational pressures as deployment scales. Academic research highlights dual-use risks and ethical gaps in TTS evaluation, identifying deepfakes, training data bias, and misinformation as concerns requiring responsible evaluation frameworks. By Q3 end, TTS demonstrates mature vendor platforms with growing Fortune 500 adoption, but reliability, ethical oversight, and cost economics remain barriers to mainstream enterprise adoption beyond specialized high-volume and accessibility workflows.
  • 2025-Q4: Amazon Polly launches five new generative voices (Austrian German, Irish English, Brazilian Portuguese, Belgian Dutch, Korean) in November, expanding engine to 31 voices across 20 locales with polyglot capability. Azure expands to 400+ neural voices (140+ languages) including 11 new US English HD voices and LLM Speech API preview. Independent content creators adopt ElevenLabs for commercial audiobook production at $99/month subscriptions, replacing $2-5k human narration while publishing 1-2 titles monthly. YouTube voice synthesis market projects $5B+ by 2026 with 35%+ CAGR, but user studies show 60%+ viewer preference for authentic narration, signaling authenticity and disclosure barriers. ElevenLabs experiences latency incidents with Flash v2.5 and Turbo v2.5 models (October 29), and technical analysis reveals persistent production trade-offs: quality (MOS 4.2-4.5) trades against latency (30-54ms) and concurrency limits (8-80 TPS) create scaling barriers. By Q4 end, TTS consolidates around vendor platforms for proven high-volume (content creation, accessibility) workflows with mature economics, but reliability variability, authenticity trust gaps, and latency constraints prevent broader real-time interactive deployment.
  • 2026-Jan: Open-source TTS ecosystem matures with benchmarked models (XTTS v2, CosyVoice 2.0, Fish Speech, F5-TTS) offering vendor-independent alternatives; ElevenLabs demonstrates multilingual educational deployment (UNICEF e-learning), confirming TTS viability for accessibility workflows. Vendor competition intensifies: Azure positions on compliance/regulations (SOC2/HIPAA), ElevenLabs on emotional range and creative applications. Platform infrastructure advances (ElevenLabs agent branching, deployment APIs), enabling enterprise voice agent development; market forecasts voice AI reaching $29.28B by 2026 with 62% of enterprises scaling AI agents. However, production reliability concerns persist: ElevenLabs API disruptions (WebRTC failures, latency), signaling scaling challenges despite platform maturity and broad Fortune 500 adoption.
  • 2026-Feb: Production deployment evidence strengthens: AI integration agencies report quantified business outcomes (22% IVR abandonment reduction, 18% onboarding acceleration, 23% comprehension improvement). Independent benchmark study (10,000 listeners) ranks 20 TTS models with 67% approval rate, showing specialized startups (Minimax, PlayHT, WellSaid Labs) outperforming Big Tech vendors. However, adoption barriers persist: streaming TTS accuracy degrades under load (800ms latency at 100 concurrent streams); vendor lock-in concerns surface with 94% of enterprises worried about platform dependency costs (2.3x-5.7x switching multiples); only 29% willing to pay premium for AI features. Cost-quality tradeoffs shift 2026 market toward efficiency models (Inworld $10 vs ElevenLabs $206 per million characters), indicating price compression and competitive intensification despite TTS technical maturity.
  • 2026-Apr: Peer-reviewed research (JASA, Patti Adank/Han Wang) validates quality threshold crossed: synthetic voices achieve 20% intelligibility advantage over human originals in noisy environments, unexpected result contradicting researcher hypothesis and signaling TTS maturation beyond parity to superiority on objective measures. Amazon Science releases BASE TTS research (billion-parameter scale, 100K hours training data) demonstrating emergent abilities in naturalness not present in smaller models — confirming scale-driven capability jumps remain active above 500M parameters. Vendor ecosystem consolidates: ElevenLabs recognized as Google Cloud Applied AI Partner of the Year with documented customer successes (Klarna 10X support resolution, Better.com 2X conversion, Revolut 31 languages), reaching $330M ARR with $100M+ net-new in Q1 2026 and expanding geographically — Madrid office with named enterprise deployments (MediaMarkt, eDreams deploying millions of multilingual interactions with double-digit resolution improvements). Independent TTS API comparison confirms 200x price variance (OpenAI $15/M vs ElevenLabs $206/M chars) despite MOS quality commoditization, with efficiency-first vendors and open-source (Mistral Voxtral Mini, Kokoro at $0.70/M) intensifying competitive pressure below the premium tier. Adoption barriers persist: streaming accuracy degrades at 100 concurrent streams (800ms latency); vendor lock-in concerns (94% of enterprises, 2.3x-5.7x switching costs); only 29% willing to pay premium. April end: TTS solidifies at leading-edge with quality validated, deployment proven across multilingual enterprise environments, but reliability and cost economics limit mainstream adoption beyond high-volume and specialized workflows.
  • 2026-May: Deployment evidence accelerates with quantified customer outcomes: Mahindra & Mahindra reports 8% conversion uplift using ElevenLabs voice agents for XUV 7XO auto launch; Spoonlabs (South Korea) reduces audio novel production from 4-7 months to hours, enabling simultaneous multilingual production across 3 countries. Independent latency benchmarks (Gradium, Coval) show sub-100ms now achievable: Gradium TTS 155ms P50 (2ms IQR) with 3.3% WER, demonstrating no quality/speed tradeoff. An empirical five-stack comparison (50 trials each) confirmed only OpenAI Realtime and LiveKit+Gemini Live stay under 300ms P95—establishing that sub-300ms end-to-end conversational latency remains a two-vendor market in practice. Competitive differentiation shifts from basic synthesis to latency consistency and pronunciation accuracy (WER 1-3% variance across vendors). Evaluation methodology gaps surface: expert analysis (Inworld AI) identifies benchmark saturation, metric fragmentation, and lack of standardization in TTS assessment industry-wide. Regulatory risk escalated: BIPA class-action filed against ElevenLabs documents systemic consent violations in TTS model training, with precedent settlements ($100M Google, $2B+ Meta) establishing significant potential liability for vendor data practices. Accessibility evidence strengthens: peer-reviewed meta-analysis shows TTS effect on reading comprehension (d=0.35) across 7.5M special education students under US IDEA, validating TAM and educational deployment. Market pricing continues compressing: open-source (Kokoro $0.70/M), efficiency vendors (Inworld $10/M), premium tier (ElevenLabs $206/M) create 200x variance despite MOS commoditization. May end: TTS firmly established at leading-edge with deployment scale and quantified ROI demonstrated, but evaluation standardization gaps, reliability variability under concurrent load, BIPA consent liability, and cost optimization require vendor focus to enable mainstream enterprise adoption beyond high-volume and accessibility workflows.
  • 2026-Jun: The vendor ecosystem continued to stratify around expressiveness and latency: ElevenLabs formalised its two-platform enterprise architecture (Creative and Agents) with SOC2/GDPR/HIPAA compliance and named customer outcomes (Klarna 10x resolutions, Revolut 31 languages), while Interspeech 2026 research (FlashTTS) demonstrated 325ms first-packet streaming via multi-token prediction and flow matching — enabling zero-shot cloning without sentence buffering. Infrastructure commoditization accelerated: NVIDIA containerized TTS as production microservice (NIM) with 55–70ms first-chunk latency across cloud GPU platforms (L40/H100, A100), signaling TTS maturity moved from vendor platform to cloud infrastructure layer. Independent audiobook platform data (Inkfluence, 5.2x monthly production growth) confirmed economics-driven adoption by independent creators, while Audio Publishers Association market data framed the consumer acceptance ceiling: AI audiobook revenue remains 0.03% of a $2.43B market, with willingness to try AI narration declining year-on-year from 70% to 61%, and 37.5% of consumers citing robotic voice as their top frustration despite MOS commoditisation at 4.2–4.3. Governance barriers hardened: ACX maintained gatekeeping on third-party TTS submissions with no published timeline for independent author acceptance, while the UK government's national-scale MoU with ElevenLabs (Department for Science, Innovation & Technology) demonstrated that public-sector deployment requires multi-model orchestration across 300+ languages with validation routing — indicating production requirements now exceed single-vendor capability. Interspeech 2026 peer-reviewed research advanced naturalness through non-verbal vocalization support (laughter, sighs with speaker identity preservation: 22.66% speech-NVV EER vs 38.93% baseline) and evaluated 17 TTS systems across 193 speakers for speech disorder voice reconstruction — identifying MOS limited sensitivity as an evaluation methodology gap critical to accessibility scaling.