The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that forecasts resource demand and automatically scales infrastructure ahead of load, rather than reactively. Includes predictive scaling based on traffic patterns and business events; distinct from reactive autoscaling which responds to current metrics only.
Predictive autoscaling is a proven, mature infrastructure practice available in GA from every major cloud provider and deeply integrated into the Kubernetes ecosystem. Rather than reacting to CPU or memory spikes, it forecasts demand from historical patterns and provisions capacity ahead of load. CNCF reports 74% enterprise adoption; documented deployments show 22-70% cost reductions and consistent 99.99% availability. Market analysis projects $6.47B by 2030 (24.4% CAGR) driven by AI-powered predictive analytics. The practice has matured from capability question to operational reliability discipline. For AI inference workloads specifically, cold-start latency has emerged as the primary constraint: NetEase Games reduced LLM cold-starts from 42 minutes to 30 seconds via data-path optimization (Fluid/Alluxio caching), proving that autoscaling economics depend on both compute provisioning AND model loading speed. Modal's 40x cold-start improvement (2,000s→50s) demonstrates that checkpoint/restore and buffer pooling enable scale-to-zero viability for single-GPU inference. IBM Research characterized vLLM startup latency systematically, enabling predictive resource planning. Critical research finding: RLScale-Bench shows that calibrated rule-based autoscalers outperform deep RL approaches across all workload patterns, redirecting engineering effort toward proper baseline tuning rather than algorithmic novelty. For LLM inference specifically, token-centric observability (Time to First Token, Time per Output Token, queue depth) replaces CPU/memory metrics as the correct scaling signal. Simple single-service scenarios remain reliable and straightforward. Multi-tier architectures expose harder problems: scaling the wrong bottleneck, thrashing from misconfigured thresholds, forecast blindness to business events, and the reality that forecastability testing is a prerequisite to avoid failed optimization investments. The practice is table-stakes, but operational discipline and correct observability choices separate teams that capture value from those that create new failure modes.
Vendor maturity and cold-start optimization as primary AI focus (June 2026). AWS released SageMaker native autoscaling for AI model inference endpoints in GA and EKS Auto Mode with quantified performance improvements: node boot 39% faster, scale-out 43% faster (254s→145s for 0-1K pods across 250 nodes), consolidation 59% faster with 30% more cluster capacity. Google Cloud launched intent-based autoscaling for GKE with native custom metrics eliminating external monitoring stacks and 5x faster reaction time (25s→5s). Google also released GKE standby buffers in June 2026 (pre-provisioned nodes that resume 2-3x faster than fresh provisioning), reducing cold-start latency from 4-6 minutes to <1 minute at P99 with low single-digit percent cost overhead. Azure continues KEDA integration into AKS, though VMSS operational reliability gaps persist in production. Specialized cold-start solutions now dominate AI infrastructure: Run:AI Model Streamer achieved 6x faster LLM loading via Azure Blob streaming (37s vs 225s for 233GB model); NetEase Games solved the autoscaling-data-path coupling by deploying Fluid/Alluxio prefetching alongside compute scaling, reducing 70B-model cold-starts from 42 minutes to 30 seconds. Modal's checkpoint/restore approach cut GPU startup 40x (2,000s→50s), proving scale-to-zero economics viable for single-GPU inference. These operational wins establish that autoscaling success now depends primarily on data loading speed, not compute provisioning algorithms.
Observability and algorithmic reality corrections (June 2026). IBM Research (MLSys 2026) published first systematic vLLM cold-start characterization with predictive analytical models, enabling serverless resource planning. Critical algorithmic finding from RLScale-Bench: calibrated rule-based autoscalers outperform deep RL on Kubernetes HPA across all six workload patterns, demonstrating that engineering effort should focus on proper baseline tuning rather than algorithmic novelty. Controlled testing of Model Predictive Control versus HPA reveals a critical deployment tradeoff: predictive approaches win on sustained load (p99 latency 75.9ms→56.2ms, cost $0.128→$0.023/hr) but lose on short 30-second traffic spikes due to Kubernetes metrics pipeline latency, demonstrating effectiveness is workload-pattern dependent. KEDA reached v2.20.0 (June 2026) with new Elastic Forecast Scaler for predictive autoscaling capabilities. For LLM inference workloads, token-centric observability (Time to First Token, Time per Output Token, queue depth) replaces CPU/memory metrics as the correct scaling signal, with practical implementations in KEDA and custom Go controllers. Sberbank deployed Prophet time-series forecasting integrated with KEDA for 5-minute-ahead predictive scaling and Project Capacity Policy for dynamic resource reallocation between daytime and nighttime services.
Enterprise capacity crisis and algorithm boundaries emerge from mid-2026 deployments. Cisco's enterprise survey (3,472 IT leaders, March-April 2026) found 73% expect infrastructure capacity limits within 24 months and AI traffic will triple (235% growth) in 3 years; 76% acknowledge need for network upgrades. Hyperscalers deployed $162B in delayed AI infrastructure projects due to power, lead time, and facility constraints, establishing capacity planning as strategic infrastructure chokepoint rather than optimization problem. Production Kubernetes benchmarking reveals algorithmic boundaries: Karpenter's native consolidation underperforms 43% under complex workloads (topology constraints, pod disruption budgets, heterogeneous resource shapes), documenting that algorithm effectiveness depends on workload simplicity. Reactive autoscaling timing gaps persist: standard policies (1-minute CloudWatch, 15-second HPA sync, 60-120 second provisioning) systematically fail for step-function traffic spikes and event-driven demand, requiring predictive or scheduled scaling for known peak patterns. LLM inference workloads reveal signal transformation: standard CPU/memory metrics are ineffective (CPU flat during inference, GPU memory pre-allocated), requiring queue-depth and throughput metrics instead. Prerequisites and failure modes remain critical: forecastability testing is mandatory (many teams build models on inherently unforecastable series); resource request accuracy is fundamental bottleneck (services declaring 1 CPU/2Gi but consuming <200m CPU/500Mi remain over-provisioned regardless of algorithm quality). Market research projects capacity management reaching $6.47B by 2030 (24.4% CAGR). Case studies document strong ROI where deployment is straightforward: Reco.se achieved 21% YoY cost reduction via KEDA; AWS platform optimization cut $120k annually; game companies achieve 32-100% service stability with 5x traffic handling. Operational discipline is required to avoid wrong-bottleneck scaling, threshold oscillation, and over-provisioning in multi-tier orchestration.
— AWS EKS Auto Mode Karpenter improvements: node boot 39% faster (13s), scale-out 43% faster (254s→145s for 0-1K pods), consolidation 59% faster with 30% more capacity. Production GA autoscaling performance gains on m5.xlarge clusters.
— LG AI Research production case: queue-depth and throughput metrics replace CPU/memory for vLLM autoscaling; identified 52 idle GPUs in nighttime off-peaks, scheduled training tasks to utilize spare capacity without expanding infrastructure.
— Cast AI production benchmark: Karpenter consolidation 43% suboptimal under complex workloads (topology constraints, PDDs, heterogeneous pods). Native consolidation works on clean workloads but reveals algorithm limits in production scenarios.
— Data Centre Digest analysis: $162B in AI projects delayed by capacity/power constraints. Documents GPU-specific challenges (30-200kW rack densities vs. 8-12kW baseline) and recommends horizontal scaling (HPA/VPA) with hybrid multicloud for AI workloads.
— Reactive autoscaling timing gap: 1-min CloudWatch, 15s HPA sync, 60-120s provisioning = capacity arrives after spike completes. Demonstrates predictive/scheduled scaling prerequisites for event-driven traffic; confirms threshold-based reactivity insufficient for step-function demand.
— Cisco survey of 3,472 IT leaders: 73% expect capacity limits within 24 months, AI will triple network traffic (235% in 3 years), 80% of AI adopters report workloads critically sensitive to reliability. Establishes urgent enterprise capacity planning imperative from agentic AI.
— Named client Reco.se (Swedish review platform) achieved 21% YoY cost reduction via KEDA event-driven autoscaling on GCP/Kubernetes with OpenTelemetry observability and resource allocation optimization based on actual usage patterns.
— Industry comparison documents ecosystem maturity in AI/ML inference: GPU-aware autoscaling essential, queue-based scaling replaced CPU-only, continuous batching improved throughput, Kubernetes-native dominates, predictive autoscaling expanded beyond traditional IT ops.