Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Capacity planning & predictive autoscaling

GOOD PRACTICE

TRAJECTORY

Advancing

AI that forecasts resource demand and automatically scales infrastructure ahead of load, rather than reactively. Includes predictive scaling based on traffic patterns and business events; distinct from reactive autoscaling which responds to current metrics only.

OVERVIEW

Predictive autoscaling is a proven, mature infrastructure practice available in GA from every major cloud provider and deeply integrated into the Kubernetes ecosystem. Rather than reacting to CPU or memory spikes, it forecasts demand from historical patterns and provisions capacity ahead of load. CNCF reports 74% enterprise adoption; documented deployments show 22-70% cost reductions and consistent 99.99% availability. Market analysis projects $6.47B by 2030 (24.4% CAGR) driven by AI-powered predictive analytics. The practice has matured from capability question to operational reliability discipline. For AI inference workloads specifically, cold-start latency has emerged as the primary constraint: NetEase Games reduced LLM cold-starts from 42 minutes to 30 seconds via data-path optimization (Fluid/Alluxio caching), proving that autoscaling economics depend on both compute provisioning AND model loading speed. Modal's 40x cold-start improvement (2,000s→50s) demonstrates that checkpoint/restore and buffer pooling enable scale-to-zero viability for single-GPU inference. IBM Research characterized vLLM startup latency systematically, enabling predictive resource planning. Critical research finding: RLScale-Bench shows that calibrated rule-based autoscalers outperform deep RL approaches across all workload patterns, redirecting engineering effort toward proper baseline tuning rather than algorithmic novelty. For LLM inference specifically, token-centric observability (Time to First Token, Time per Output Token, queue depth) replaces CPU/memory metrics as the correct scaling signal. Simple single-service scenarios remain reliable and straightforward. Multi-tier architectures expose harder problems: scaling the wrong bottleneck, thrashing from misconfigured thresholds, forecast blindness to business events, and the reality that forecastability testing is a prerequisite to avoid failed optimization investments. The practice is table-stakes, but operational discipline and correct observability choices separate teams that capture value from those that create new failure modes.

CURRENT LANDSCAPE

Vendor maturity and cold-start optimization as primary AI focus (June 2026). AWS released SageMaker native autoscaling for AI model inference endpoints in GA and EKS Auto Mode with quantified performance improvements: node boot 39% faster, scale-out 43% faster (254s→145s for 0-1K pods across 250 nodes), consolidation 59% faster with 30% more cluster capacity. Google Cloud launched intent-based autoscaling for GKE with native custom metrics eliminating external monitoring stacks and 5x faster reaction time (25s→5s). Google also released GKE standby buffers in June 2026 (pre-provisioned nodes that resume 2-3x faster than fresh provisioning), reducing cold-start latency from 4-6 minutes to <1 minute at P99 with low single-digit percent cost overhead. Azure continues KEDA integration into AKS, though VMSS operational reliability gaps persist in production. Specialized cold-start solutions now dominate AI infrastructure: Run:AI Model Streamer achieved 6x faster LLM loading via Azure Blob streaming (37s vs 225s for 233GB model); NetEase Games solved the autoscaling-data-path coupling by deploying Fluid/Alluxio prefetching alongside compute scaling, reducing 70B-model cold-starts from 42 minutes to 30 seconds. Modal's checkpoint/restore approach cut GPU startup 40x (2,000s→50s), proving scale-to-zero economics viable for single-GPU inference. These operational wins establish that autoscaling success now depends primarily on data loading speed, not compute provisioning algorithms.

Observability and algorithmic reality corrections (June 2026). IBM Research (MLSys 2026) published first systematic vLLM cold-start characterization with predictive analytical models, enabling serverless resource planning. Critical algorithmic finding from RLScale-Bench: calibrated rule-based autoscalers outperform deep RL on Kubernetes HPA across all six workload patterns, demonstrating that engineering effort should focus on proper baseline tuning rather than algorithmic novelty. Controlled testing of Model Predictive Control versus HPA reveals a critical deployment tradeoff: predictive approaches win on sustained load (p99 latency 75.9ms→56.2ms, cost $0.128→$0.023/hr) but lose on short 30-second traffic spikes due to Kubernetes metrics pipeline latency, demonstrating effectiveness is workload-pattern dependent. KEDA reached v2.20.0 (June 2026) with new Elastic Forecast Scaler for predictive autoscaling capabilities. For LLM inference workloads, token-centric observability (Time to First Token, Time per Output Token, queue depth) replaces CPU/memory metrics as the correct scaling signal, with practical implementations in KEDA and custom Go controllers. Sberbank deployed Prophet time-series forecasting integrated with KEDA for 5-minute-ahead predictive scaling and Project Capacity Policy for dynamic resource reallocation between daytime and nighttime services.

Enterprise capacity crisis and algorithm boundaries emerge from mid-2026 deployments. Cisco's enterprise survey (3,472 IT leaders, March-April 2026) found 73% expect infrastructure capacity limits within 24 months and AI traffic will triple (235% growth) in 3 years; 76% acknowledge need for network upgrades. Hyperscalers deployed $162B in delayed AI infrastructure projects due to power, lead time, and facility constraints, establishing capacity planning as strategic infrastructure chokepoint rather than optimization problem. Production Kubernetes benchmarking reveals algorithmic boundaries: Karpenter's native consolidation underperforms 43% under complex workloads (topology constraints, pod disruption budgets, heterogeneous resource shapes), documenting that algorithm effectiveness depends on workload simplicity. Reactive autoscaling timing gaps persist: standard policies (1-minute CloudWatch, 15-second HPA sync, 60-120 second provisioning) systematically fail for step-function traffic spikes and event-driven demand, requiring predictive or scheduled scaling for known peak patterns. LLM inference workloads reveal signal transformation: standard CPU/memory metrics are ineffective (CPU flat during inference, GPU memory pre-allocated), requiring queue-depth and throughput metrics instead. Prerequisites and failure modes remain critical: forecastability testing is mandatory (many teams build models on inherently unforecastable series); resource request accuracy is fundamental bottleneck (services declaring 1 CPU/2Gi but consuming <200m CPU/500Mi remain over-provisioned regardless of algorithm quality). Market research projects capacity management reaching $6.47B by 2030 (24.4% CAGR). Case studies document strong ROI where deployment is straightforward: Reco.se achieved 21% YoY cost reduction via KEDA; AWS platform optimization cut $120k annually; game companies achieve 32-100% service stability with 5x traffic handling. Operational discipline is required to avoid wrong-bottleneck scaling, threshold oscillation, and over-provisioning in multi-tier orchestration.

TIER HISTORY

ResearchJan-2018 → Jan-2018
Bleeding EdgeJan-2018 → Jul-2022
Leading EdgeJul-2022 → Oct-2024
Good PracticeOct-2024 → present

EVIDENCE (149)

— AWS EKS Auto Mode Karpenter improvements: node boot 39% faster (13s), scale-out 43% faster (254s→145s for 0-1K pods), consolidation 59% faster with 30% more capacity. Production GA autoscaling performance gains on m5.xlarge clusters.

— LG AI Research production case: queue-depth and throughput metrics replace CPU/memory for vLLM autoscaling; identified 52 idle GPUs in nighttime off-peaks, scheduled training tasks to utilize spare capacity without expanding infrastructure.

— Cast AI production benchmark: Karpenter consolidation 43% suboptimal under complex workloads (topology constraints, PDDs, heterogeneous pods). Native consolidation works on clean workloads but reveals algorithm limits in production scenarios.

— Data Centre Digest analysis: $162B in AI projects delayed by capacity/power constraints. Documents GPU-specific challenges (30-200kW rack densities vs. 8-12kW baseline) and recommends horizontal scaling (HPA/VPA) with hybrid multicloud for AI workloads.

— Reactive autoscaling timing gap: 1-min CloudWatch, 15s HPA sync, 60-120s provisioning = capacity arrives after spike completes. Demonstrates predictive/scheduled scaling prerequisites for event-driven traffic; confirms threshold-based reactivity insufficient for step-function demand.

— Cisco survey of 3,472 IT leaders: 73% expect capacity limits within 24 months, AI will triple network traffic (235% in 3 years), 80% of AI adopters report workloads critically sensitive to reliability. Establishes urgent enterprise capacity planning imperative from agentic AI.

— Named client Reco.se (Swedish review platform) achieved 21% YoY cost reduction via KEDA event-driven autoscaling on GCP/Kubernetes with OpenTelemetry observability and resource allocation optimization based on actual usage patterns.

— Industry comparison documents ecosystem maturity in AI/ML inference: GPU-aware autoscaling essential, queue-based scaling replaced CPU-only, continuous batching improved throughput, Kubernetes-native dominates, predictive autoscaling expanded beyond traditional IT ops.

HISTORY

  • 2018: AWS launches Predictive Scaling for EC2 in general availability, introducing machine learning-based capacity forecasting to mainstream cloud infrastructure. Major vendors begin shipping predictive autoscaling as a core platform capability.
  • 2019: Predictive autoscaling enters production at scale—AWS validates the approach during Prime Day with massive server equivalent scaling. Ecosystem matures with Citrix shipping VDI autoscaling. However, production failures emerge: cloud provider prediction logic fails under edge cases (Azure scale-in blocking due to false memory spike predictions; Kubernetes spot instance capacity conflicts). Academic research finds existing self-aware autoscaling systems remain unreliably deployable.
  • 2020: Broad adoption accelerates across cloud and container orchestration platforms. Google Cloud releases scale-in controls; AWS demonstrates ML-based predictive scaling in game services (GameServer Autopilot using SageMaker RL); independent financial services (Monzo) publish Kubernetes autoscaling case studies. Research continues advancing RL approaches. Operational challenges remain widespread: CodeDeploy integration failures cause infinite scale-in/out loops; parameter tuning issues lead to erratic scaling; predictive models fail on traffic anomalies.
  • 2021: Predictive autoscaling transitions from specialized feature to mainstream platform capability. AWS moves predictive scaling into native EC2 Auto Scaling policy (May 2021), improving accessibility; extends support to custom application metrics by November 2021. Google Cloud launches predictive autoscaling for Compute Engine in preview (March 2021), validating vendor convergence around the approach. Research papers on OpenStack and neural network approaches advance the algorithmic foundations. Operational maturity gaps persist despite wider adoption.
  • 2022-H1: AWS releases predictive scaling backfill feature (May 2022), enabling retroactive forecast validation. Kubernetes ecosystem matures with KEDA integration guides from AWS and practitioner deployments demonstrating latency reduction in production. However, Azure VMSS scale failures in early 2022 reaffirm that integration reliability remains inconsistent across vendors. Adoption accelerates but operational challenges persist.
  • 2022-H2: Cloud provider expansion continues with AWS rolling out predictive scaling to Jakarta (October), while Azure prepares GA of native predictive autoscaling. Kubernetes ecosystem deepens with specialized tooling—Avesha launches Smart Scaler (October), a vendor-specific HPA product using RL. Academic research advances with graph neural networks and energy-efficient multi-resource prediction frameworks being validated. Ecosystem shows broad adoption with persistent reliability gaps.
  • 2023-H1: Predictive autoscaling becomes normalized as an expected platform feature across all major providers. AWS improves EC2 forecast frequency from daily to 4x daily (January 2023), reducing forecast windows from 24h to 6h for better responsiveness. AWS adds console-based activation recommendations to reduce adoption friction. Azure VMSS native predictive autoscaling reaches general availability with 7-day minimum training data requirement. Alibaba Cloud publishes VLDB 2023 research on hyperscale predictive autoscaling deployment (MagicScaler), validating enterprise production viability. Kubernetes practitioners demonstrate GPU and ML workload optimization with KEDA. Adoption has moved from "competitive advantage" to "table-stakes infrastructure"; operational reliability remains the primary constraint.
  • 2023-H2: Platform vendors continue ecosystem expansion: Microsoft ships KEDA as native add-on for Azure Kubernetes Service (November), streamlining event-driven and predictive scaling integration. AWS enhances autoscaling reliability with instance refresh rollback controls via CloudWatch alarms (August), enabling proactive failure detection. However, operational challenges persist: Pivotal Cloud Foundry experiences metric accuracy failures causing autoscaling thrashes; Azure VMSS suffers regression in ephemeral OS disk handling affecting autoscaling; AWS Aurora encounters 'anticipated flapping' algorithm limitations blocking scale-in. Academic research validates predictive autoscaling feasibility with 99% performance improvements. Landscape reflects matured platforms with well-documented reliability gaps rather than capability gaps.
  • 2024-Q1: Predictive autoscaling enters steady-state maturity as expected platform capability across cloud and Kubernetes. AWS optimizes Windows workloads (78% scale-out time reduction via EC2 Image Builder), Azure continues with DaaS integration (Citrix Autoscale Insights preview), and academic research advances self-adaptive microservice approaches (50% CPU savings vs HPA, GRU-based VNF prediction at 98% accuracy). Platform fragmentation shifts focus from feature parity to operational reliability—vendors offer similar core capabilities but diverge significantly on failure recovery, parameter tuning complexity, and behavior during traffic anomalies. Adoption is universal; the practice is now table-stakes infrastructure rather than competitive advantage.
  • 2024-Q2: Cloud platforms complete Kubernetes integration: Azure KEDA reaches GA in Portal (May 2024) and native AKS, with AWS/Microsoft publishing vendor-specific tutorials demonstrating ecosystem maturity. AWS Well-Architected Framework officially endorses predictive scaling (June 2024), cementing adoption as table-stakes. Academic innovation continues: BIAS Autoscaler achieves 25% cost reduction via burstable instances, advancing algorithm approaches. However, production reliability gaps surface in canary deployments: Argo Rollouts documents 30-second service disruption windows during dynamic scaling (June 2024), while Azure VMSS continues experiencing multi-hour scaling delays. The practice remains mature and widely adopted, but real-world deployment challenges in complex orchestration scenarios reveal the operational maturity gap between simple and sophisticated use cases.
  • 2024-Q3: Predictive autoscaling market expands at 13.2% CAGR; industry adoption reaches $407B and growing. Academic research identifies novel failure modes: PREFACE framework (FSE 2024) reveals autoscaling introduces previously undetected failure patterns in distributed applications, requiring specialized prediction techniques. Comprehensive review in Sensors journal underscores persisting challenges in ML-based forecasting. Practitioner guidance increasingly acknowledges over-engineering risks—real-world deployments show failures from database DoS due to uncoordinated scaling, slow boot times, and threshold misconfiguration. Azure VMSS operational reliability remains inconsistent: official troubleshooting guidance documents flapping thresholds, diagnostic extension failures, and Flex VM scale-set delays. Signal balance: platform maturity and universal adoption are uncontested, but deployment specialists emphasize that operational complexity grows with orchestration sophistication; simple use cases remain reliable while multi-layered scenarios reveal brittle failure modes.
  • 2024-Q4: Platform maturity solidifies: AWS extends predictive scaling to ECS (November 2024), MongoDB demonstrates production research on Atlas vertical scaling, vendor ecosystem remains robust with Alibaba/Azure/Grafana production deployments via KEDA. However, operational reality persists—KEDA discussions surface multi-minute scaling delays, revealing that even mature platforms encounter latency challenges at scale.
  • 2025-Q1: Platform investment continues across vendors: Microsoft publishes updated KEDA integration guides for Azure AKS (March 2025); AWS and practitioners publish multi-service optimization tutorials; community discussion highlights fundamental limitations of predefined metrics (lagging indicators, over-simplification), reflecting ecosystem maturity focused on operational refinement rather than capability expansion. Adoption remains universal and table-stakes; focus shifts entirely to reliable deployment patterns in complex scenarios.
  • 2025-Q2: Vendor expansion into specialized workloads: Oracle Cloud launches GA custom metrics autoscaling for AI model deployments (April 2025); KServe publishes KEDA integration tutorials for inference service scaling (May 2025). Academic validation: Aalborg University research demonstrates predictive autoscaler reducing response time 14-20% and high-latency requests 93-95% vs reactive HPA (June 2025). Critical practitioner assessments surface persistent limitations: complexity across layered systems, reactivity windows despite forecasting, and cost risks from over-provisioning. Ecosystem remains mature and universally adopted; operational reliability in sophisticated scenarios continues as primary constraint.
  • 2025-Q3: Vendor ecosystem consolidation and validation: Citrix announces VDI predictive autoscaling analysis tooling in GA (September 2025), expanding vendor breadth beyond cloud-native. Practitioner case studies show strong ROI: AWS deployments achieve 70% cost reduction and 99.99% availability using predictive policies (August 2025). ML/AI workload orchestration deepens: technical guides demonstrate KEDA-based autoscaling for GPU-intensive inference with cost optimization (August 2025). Critical assessment identifies persistent failure modes: reactive latency despite forecasting, wrong-target scaling when bottlenecks are downstream, thrashing from misconfiguration, and business-context blindness. Platform maturity is uncontested; deployment complexity in multi-layer scenarios remains primary operational constraint.
  • 2025-Q4: Capability expansion beyond traditional compute: Grab deploys ML predictive autoscaling for Flink stream processing (October 2025), addressing 2.5x app growth through CPU forecasting; AWS expands predictive scaling to 6 new regions signaling continued investment; CNCF analysis highlights persistent performance/reliability/cost trade-offs in Kubernetes autoscaling with KEDA and Karpenter; production maturity guides document stable HPA adoption patterns. Ecosystem remains mature and universal; vendor and practitioner focus shifts entirely to reliable deployment in layered systems and specialized workload categories.
  • 2026-Jan: Operational reliability challenges persist despite ecosystem maturity: Azure Kubernetes Service autoscaling failures documented in production, with node pool scaling stuck due to quota limits, capacity constraints, or subnet IP exhaustion. Platform foundation remains stable for simple use cases; complex multi-layered orchestration continues to require careful failure recovery planning and manual intervention fallbacks.
  • 2026-Feb: Ecosystem maturity confirmed with production evidence and novel research directions: Calendly demonstrates predictive HPA deployment using Datadog time-shifted metrics eliminating hourly traffic spike latency; Aalborg University research validates 34.68% energy reduction via NeuroScaler; CNCF reports 74% enterprise adoption with Q1 2026 fintech case study showing 22% cost reduction via Karpenter. Google Cloud and practitioners publish deployment guides. Critical assessment notes persistent lack of public case studies despite ecosystem technical maturity, highlighting documentation gap. Platform capability remains table-stakes; deployment complexity in multi-layered scenarios remains primary constraint.
  • 2026-Apr: Evidence confirmed predictive autoscaling's expanding role in AI inference workloads while documenting persistent barriers specific to that domain. Simplismart.ai achieved 60-70 second GPU inference scale-up using EC2 warm pools, down from 5-6 minutes previously, validating the warm pool strategy for AI workloads. NVIDIA's Dynamo SLA Planner extended ML-based capacity forecasting (ARIMA, Kalman filter, Prophet) to GPU infrastructure, and Amazon published ACM SoCC research on ensemble forecasting algorithms for Redshift Serverless. Blizzard published an operational Kubernetes autoscaling playbook for predictable game-launch spikes. A KubeCon EU practitioner analysis identified structural adoption barriers for AI inference: GPU cold-start delays of 30-120 seconds make reactive HPA inadequate, and token latency variation undermines threshold-based scaling—reinforcing that predictive pre-scaling is now the minimum viable approach for LLM serving infrastructure. Enterprise production deployments show consistent ROI: MongoDB's experiment on 10K production clusters validated cost savings at 9 cents/hour per replica set; Grab achieved 55% infrastructure cost reduction via KEDA on Kafka consumers (CPU utilization 15%→57% while maintaining SLA and zero data loss); Uber operates Capacity Recommendation Engine for thousands of microservices. KEDA reached CNCF Graduated status (August 2023), cementing event-driven autoscaling as an industry-standard capability endorsed by vendor-neutral governance. A critical practitioner assessment identified a fundamental modeling risk: many teams build forecasting models on inherently unforecastable time series, with diagnostic techniques to test forecastability before modeling now recommended as a pre-deployment prerequisite.
  • 2026-May: Ecosystem maturation confirmed with continued vendor investment and industry-wide capacity constraints emerging as primary bottleneck. Google Cloud (April 2026) launched intent-based autoscaling for GKE with 5x faster reaction time (25s→5s) and native custom metrics, eliminating external observability stack dependencies. AWS SageMaker native autoscaling for AI model inference endpoints reached GA, confirming hyperscaler commitment to AI-specific capacity planning as a mainstream feature. Cast AI and Zesty expanded coordinated HPA/VPA optimization — Zesty platform GA demonstrates 40% cluster optimization and 10% size reduction at production scale. KEDA queue-depth scaling for vLLM inference achieved 40% GPU spend reduction and 60% p99 latency improvement, validating event-driven scaling over CPU-metric approaches for AI workloads. Cold-start optimization emerged as the dominant AI autoscaling challenge: NetEase Games reduced 70B-model cold-starts from 42 minutes to 30 seconds via Fluid/Alluxio data-path caching; Run:AI Model Streamer achieved 6x faster LLM loading from Azure Blob (37s vs 225s for a 233GB model); Modal cut GPU inference cold starts 40x (2,000s→50s) across 15M production restores via checkpoint/restore and buffer pooling; IBM Research (MLSys 2026) published the first systematic vLLM cold-start characterization with an analytical predictive model enabling serverless resource planning. RLScale-Bench research found calibrated rule-based autoscalers outperform deep RL on Kubernetes HPA across all workload patterns, redirecting engineering effort toward baseline tuning. Sberbank deployed Prophet time-series forecasting integrated with KEDA for 5-minute-ahead predictive scaling with dynamic resource reallocation. Critical signal: Datadog production analysis established capacity limits as the primary operational bottleneck — 60% of AI request failures directly caused by capacity constraints across thousands of production systems. Market analysis projects capacity management reaching $6.47B by 2030 (24.4% CAGR) driven by AI-powered predictive analytics. Practitioner assessments emphasize that simple deployments remain reliable while multi-layered scenarios require operational discipline; feedback-loop engineering costs and forecastability testing prerequisites remain significant implementation barriers.
  • 2026-Jun: Vendor innovation and deployment insights confirmed continued ecosystem evolution while an enterprise capacity crisis emerged as a new strategic dimension. Google Cloud launched GKE standby buffers (pre-provisioned nodes resuming 2-3x faster), reducing cold-start latency from 4-6 minutes to under 1 minute P99 with minimal cost overhead; KEDA reached v2.20.0 with a new Elastic Forecast Scaler expanding predictive autoscaling into the event-driven orchestration ecosystem; AWS EKS Auto Mode released quantified Karpenter improvements (node boot 39% faster at 13s, scale-out 43% faster at 254s→145s for 0-1K pods, consolidation 59% faster). A Cisco enterprise survey (3,472 IT leaders) found 73% expect infrastructure capacity limits within 24 months, with AI set to triple network traffic (235% in 3 years) — elevating capacity planning from optimization problem to strategic infrastructure chokepoint. Controlled testing revealed workload-pattern dependence: Model Predictive Control outperforms HPA on sustained load (cost $0.128→$0.023/hr) but loses on 30-second spikes due to Kubernetes metrics pipeline latency; LG AI Research's production case established that queue-depth and throughput metrics replace CPU/memory as the correct scaling signal for vLLM inference, identifying 52 idle GPUs during nighttime off-peaks usable for training without expanding infrastructure. Production case studies confirmed sustained ROI (Reco.se 21% YoY cost reduction via KEDA; AWS optimization cut $120k annually; Korean game company achieved 5x traffic handling with 32% cost savings), while a Cast AI 7-day Karpenter consolidation benchmark found 43% suboptimal performance under complex workloads (topology constraints, pod disruption budgets, heterogeneous resource shapes) — confirming that algorithm effectiveness depends heavily on workload simplicity. Critical bottleneck crystallised: autoscaler effectiveness is fundamentally limited by accuracy of resource request declarations — teams declaring 1 CPU/2Gi RAM but consuming under 200m CPU/500Mi RAM remain over-provisioned regardless of algorithm quality.