Capacity planning & predictive autoscaling

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

GOOD PRACTICE

TRAJECTORY↑ Advancing

AI that forecasts resource demand and automatically scales infrastructure ahead of load, rather than reactively. Includes predictive scaling based on traffic patterns and business events; distinct from reactive autoscaling which responds to current metrics only.

OVERVIEW

Predictive autoscaling is a proven, mature infrastructure practice available in GA from every major cloud provider and deeply integrated into the Kubernetes ecosystem. Rather than reacting to CPU or memory spikes, it forecasts demand from historical patterns and provisions capacity ahead of load. CNCF reports 74% enterprise adoption; documented deployments show 22-70% cost reductions and consistent 99.99% availability. Market analysis projects $6.47B by 2030 (24.4% CAGR) driven by AI-powered predictive analytics. Vendor innovation in May 2026 continues: Google Cloud released intent-based autoscaling for GKE with 5x faster reaction time (25s→5s) and native custom metrics eliminating external monitoring dependencies; AWS confirmed ongoing platform investment; Cast AI and Zesty expanded coordinated HPA/VPA optimization at production scale with named customer results (40% cluster optimization, 65% infrastructure reduction). The practice has cleared the "does it work" threshold; the question now is deployment reliability across layered systems. Capacity limits are the primary operational bottleneck preventing AI scaling: Datadog analysis of thousands of production systems shows 60% of AI request failures directly due to capacity constraints. Operationally, simple single-service scenarios remain reliable and straightforward. Multi-tier architectures expose harder problems: scaling the wrong bottleneck, thrashing from misconfigured thresholds, forecast blindness to business events, and GPU cold-start delays (30-120s) that render reactive HPA inadequate for LLM serving. Accurate prediction demands stable traffic history; anomalous spikes and unforecastable time series patterns still require reactive fallbacks and forecastability testing as prerequisites. The practice is table-stakes, but operational discipline separates teams that capture value from those that create new failure modes.

CURRENT LANDSCAPE

Vendor maturity and AI-specific acceleration (May 2026). Google Cloud launched intent-based autoscaling for GKE with native custom metrics eliminating external monitoring stacks and 5x faster reaction time (25s→5s). AWS maintains predictive scaling across EC2, ECS, and Auto Scaling Groups with API-level documentation confirming April 2026 product status. Azure continues KEDA integration into AKS, though VMSS operational reliability gaps persist in production. Cloud-native ecosystem has consolidated around KEDA (CNCF Graduated) for event-driven and GPU inference workloads, with production deployments at Alibaba, Grab, Calendly, Blizzard, and Grafana. Cast AI and Thoras now offer ML-powered predictive workload scaling as GA products. Specialized vendors (StormForge, Baseten, Kedify) focus on optimizing predictive autoscaling for AI inference, addressing the distinct challenge of GPU cold-start delays (5-8 minutes for 70B models).

AI-specific capacity constraints drive adoption. Industry-wide utilization analysis (Cast AI, April 2026) across tens of thousands of Kubernetes clusters reveals structural underutilization: GPU 5%, CPU 8%, memory 20%—indicating that predictive autoscaling capability exists but deployment challenges prevent efficient utilization. Datadog analysis of production AI systems confirms capacity limits as the primary bottleneck: 60% of AI request failures are directly caused by capacity constraints, establishing this practice as essential infrastructure for AI scaling. Practitioner benchmarks show concrete wins: KEDA queue-depth scaling for vLLM achieved 40% GPU spend reduction and 60% p99 latency improvement; Simplismart.ai deployed warm pools for inference scale-up achieving 60-70 seconds vs 5-6 minutes previously; Grab deployed ML predictive autoscaling for Kafka consumers reducing cost 55% while increasing utilization from 15% to 57%.

Deployment complexity and operational prerequisites remain barriers. Sedai CTO analysis documents reactive autoscaling timing lag (2-4 minutes) and production costs of feedback-loop engineering. Practitioner assessments flag a critical structural barrier: many teams build forecasting models on inherently unforecastable time series; diagnostic forecastability testing is now recommended as a pre-deployment prerequisite to avoid failed optimization investments. GPU cold-start mechanics are now well-documented: container image pull (4-6 minutes) dominates, while weight loading and CUDA graph capture are secondary; this establishes why predictive pre-scaling is minimum viable for LLM serving. Market research projects capacity management reaching $6.47B by 2030 (24.4% CAGR) driven by AI-powered predictive analytics and automation. ROI is strong where deployment is straightforward (70% cost reductions, 99.99% availability in documented case studies), but operational discipline is required to avoid wrong-bottleneck scaling, threshold oscillation, and over-provisioning in multi-tier orchestration.

TIER HISTORY

ResearchJan-2018 → Jan-2018

Bleeding EdgeJan-2018 → Jul-2022

Leading EdgeJul-2022 → Oct-2024

Good PracticeOct-2024 → present

EVIDENCE (118)

In-Depth Examination of Segments, Industry Developments, and Key Players in the Capacity Management MarketIndustry Reports2026-05-14

— Market research: capacity management market projected $6.47B by 2030 (24.4% CAGR), driven by AI-powered predictive analytics, cloud deployment, and automation adoption; signals widespread enterprise adoption.

AutoscalingProduct Launches2026-05-13

— Baseten AI inference platform autoscaling uses concurrency-target with asymmetric scale-up/down behavior; demonstrates modern AI workload autoscaling practice including scale-to-zero.

KEDA, Karpenter, and Event-Driven Scaling in 2026Opinion2026-05-12

— Practitioner benchmark: KEDA queue-depth scaling for vLLM achieved 40% GPU spend reduction and 60% p99 latency improvement by scaling on inference queue depth instead of CPU metrics.

Workload Autoscaler configuration - Getting started - Cast AIProduct Launches2026-05-12

— Cast AI ML-powered predictive workload scaling forecasts future resource needs from historical patterns, moving beyond reactive scaling; represents ecosystem adoption of ML-based predictive autoscaling.

Predictive auto-scaling in Kubernetes using ARIMA time series ...Research Papers2026-05-12

— Tampere University master's thesis: ARIMA predictive autoscaling forecasts CPU utilization 45s ahead, successfully scaled 1→8 replicas during demand spike, validating computational lightness of classical forecasting.

Model autoscaling and hibernation - Getting started - Cast AIProduct Launches2026-05-12

— Cast AI AI Enabler for vLLM autoscaling: replica-based scaling, intelligent hibernation for zero-cost idle periods, SaaS fallback routing; addresses AI-specific capacity management challenges.

Mastering Kubernetes Autoscaling for AI and Real-Time Traffic - KedifyConference Talks2026-05-12

— Kedify maintainer at DevOpsCon 2026: practical KEDA strategies for AI/LLM workload autoscaling and real-time traffic handling; reflects emergence of AI-specific autoscaling as distinct practitioner challenge.

How We're Closing the Gaps in Your Cloud BudgetAdoption Metrics2026-05-11

— Sedai customer outcomes: typical 30%+ cost reduction through application-aware intelligent autoscaling; adoption metric showing commercial viability of predictive capacity optimization platforms.

HISTORY

2018: AWS launches Predictive Scaling for EC2 in general availability, introducing machine learning-based capacity forecasting to mainstream cloud infrastructure. Major vendors begin shipping predictive autoscaling as a core platform capability.
2019: Predictive autoscaling enters production at scale—AWS validates the approach during Prime Day with massive server equivalent scaling. Ecosystem matures with Citrix shipping VDI autoscaling. However, production failures emerge: cloud provider prediction logic fails under edge cases (Azure scale-in blocking due to false memory spike predictions; Kubernetes spot instance capacity conflicts). Academic research finds existing self-aware autoscaling systems remain unreliably deployable.
2020: Broad adoption accelerates across cloud and container orchestration platforms. Google Cloud releases scale-in controls; AWS demonstrates ML-based predictive scaling in game services (GameServer Autopilot using SageMaker RL); independent financial services (Monzo) publish Kubernetes autoscaling case studies. Research continues advancing RL approaches. Operational challenges remain widespread: CodeDeploy integration failures cause infinite scale-in/out loops; parameter tuning issues lead to erratic scaling; predictive models fail on traffic anomalies.
2021: Predictive autoscaling transitions from specialized feature to mainstream platform capability. AWS moves predictive scaling into native EC2 Auto Scaling policy (May 2021), improving accessibility; extends support to custom application metrics by November 2021. Google Cloud launches predictive autoscaling for Compute Engine in preview (March 2021), validating vendor convergence around the approach. Research papers on OpenStack and neural network approaches advance the algorithmic foundations. Operational maturity gaps persist despite wider adoption.
2022-H1: AWS releases predictive scaling backfill feature (May 2022), enabling retroactive forecast validation. Kubernetes ecosystem matures with KEDA integration guides from AWS and practitioner deployments demonstrating latency reduction in production. However, Azure VMSS scale failures in early 2022 reaffirm that integration reliability remains inconsistent across vendors. Adoption accelerates but operational challenges persist.
2022-H2: Cloud provider expansion continues with AWS rolling out predictive scaling to Jakarta (October), while Azure prepares GA of native predictive autoscaling. Kubernetes ecosystem deepens with specialized tooling—Avesha launches Smart Scaler (October), a vendor-specific HPA product using RL. Academic research advances with graph neural networks and energy-efficient multi-resource prediction frameworks being validated. Ecosystem shows broad adoption with persistent reliability gaps.
2023-H1: Predictive autoscaling becomes normalized as an expected platform feature across all major providers. AWS improves EC2 forecast frequency from daily to 4x daily (January 2023), reducing forecast windows from 24h to 6h for better responsiveness. AWS adds console-based activation recommendations to reduce adoption friction. Azure VMSS native predictive autoscaling reaches general availability with 7-day minimum training data requirement. Alibaba Cloud publishes VLDB 2023 research on hyperscale predictive autoscaling deployment (MagicScaler), validating enterprise production viability. Kubernetes practitioners demonstrate GPU and ML workload optimization with KEDA. Adoption has moved from "competitive advantage" to "table-stakes infrastructure"; operational reliability remains the primary constraint.
2023-H2: Platform vendors continue ecosystem expansion: Microsoft ships KEDA as native add-on for Azure Kubernetes Service (November), streamlining event-driven and predictive scaling integration. AWS enhances autoscaling reliability with instance refresh rollback controls via CloudWatch alarms (August), enabling proactive failure detection. However, operational challenges persist: Pivotal Cloud Foundry experiences metric accuracy failures causing autoscaling thrashes; Azure VMSS suffers regression in ephemeral OS disk handling affecting autoscaling; AWS Aurora encounters 'anticipated flapping' algorithm limitations blocking scale-in. Academic research validates predictive autoscaling feasibility with 99% performance improvements. Landscape reflects matured platforms with well-documented reliability gaps rather than capability gaps.
2024-Q1: Predictive autoscaling enters steady-state maturity as expected platform capability across cloud and Kubernetes. AWS optimizes Windows workloads (78% scale-out time reduction via EC2 Image Builder), Azure continues with DaaS integration (Citrix Autoscale Insights preview), and academic research advances self-adaptive microservice approaches (50% CPU savings vs HPA, GRU-based VNF prediction at 98% accuracy). Platform fragmentation shifts focus from feature parity to operational reliability—vendors offer similar core capabilities but diverge significantly on failure recovery, parameter tuning complexity, and behavior during traffic anomalies. Adoption is universal; the practice is now table-stakes infrastructure rather than competitive advantage.
2024-Q2: Cloud platforms complete Kubernetes integration: Azure KEDA reaches GA in Portal (May 2024) and native AKS, with AWS/Microsoft publishing vendor-specific tutorials demonstrating ecosystem maturity. AWS Well-Architected Framework officially endorses predictive scaling (June 2024), cementing adoption as table-stakes. Academic innovation continues: BIAS Autoscaler achieves 25% cost reduction via burstable instances, advancing algorithm approaches. However, production reliability gaps surface in canary deployments: Argo Rollouts documents 30-second service disruption windows during dynamic scaling (June 2024), while Azure VMSS continues experiencing multi-hour scaling delays. The practice remains mature and widely adopted, but real-world deployment challenges in complex orchestration scenarios reveal the operational maturity gap between simple and sophisticated use cases.
2024-Q3: Predictive autoscaling market expands at 13.2% CAGR; industry adoption reaches $407B and growing. Academic research identifies novel failure modes: PREFACE framework (FSE 2024) reveals autoscaling introduces previously undetected failure patterns in distributed applications, requiring specialized prediction techniques. Comprehensive review in Sensors journal underscores persisting challenges in ML-based forecasting. Practitioner guidance increasingly acknowledges over-engineering risks—real-world deployments show failures from database DoS due to uncoordinated scaling, slow boot times, and threshold misconfiguration. Azure VMSS operational reliability remains inconsistent: official troubleshooting guidance documents flapping thresholds, diagnostic extension failures, and Flex VM scale-set delays. Signal balance: platform maturity and universal adoption are uncontested, but deployment specialists emphasize that operational complexity grows with orchestration sophistication; simple use cases remain reliable while multi-layered scenarios reveal brittle failure modes.
2024-Q4: Platform maturity solidifies: AWS extends predictive scaling to ECS (November 2024), MongoDB demonstrates production research on Atlas vertical scaling, vendor ecosystem remains robust with Alibaba/Azure/Grafana production deployments via KEDA. However, operational reality persists—KEDA discussions surface multi-minute scaling delays, revealing that even mature platforms encounter latency challenges at scale.
2025-Q1: Platform investment continues across vendors: Microsoft publishes updated KEDA integration guides for Azure AKS (March 2025); AWS and practitioners publish multi-service optimization tutorials; community discussion highlights fundamental limitations of predefined metrics (lagging indicators, over-simplification), reflecting ecosystem maturity focused on operational refinement rather than capability expansion. Adoption remains universal and table-stakes; focus shifts entirely to reliable deployment patterns in complex scenarios.
2025-Q2: Vendor expansion into specialized workloads: Oracle Cloud launches GA custom metrics autoscaling for AI model deployments (April 2025); KServe publishes KEDA integration tutorials for inference service scaling (May 2025). Academic validation: Aalborg University research demonstrates predictive autoscaler reducing response time 14-20% and high-latency requests 93-95% vs reactive HPA (June 2025). Critical practitioner assessments surface persistent limitations: complexity across layered systems, reactivity windows despite forecasting, and cost risks from over-provisioning. Ecosystem remains mature and universally adopted; operational reliability in sophisticated scenarios continues as primary constraint.
2025-Q3: Vendor ecosystem consolidation and validation: Citrix announces VDI predictive autoscaling analysis tooling in GA (September 2025), expanding vendor breadth beyond cloud-native. Practitioner case studies show strong ROI: AWS deployments achieve 70% cost reduction and 99.99% availability using predictive policies (August 2025). ML/AI workload orchestration deepens: technical guides demonstrate KEDA-based autoscaling for GPU-intensive inference with cost optimization (August 2025). Critical assessment identifies persistent failure modes: reactive latency despite forecasting, wrong-target scaling when bottlenecks are downstream, thrashing from misconfiguration, and business-context blindness. Platform maturity is uncontested; deployment complexity in multi-layer scenarios remains primary operational constraint.
2025-Q4: Capability expansion beyond traditional compute: Grab deploys ML predictive autoscaling for Flink stream processing (October 2025), addressing 2.5x app growth through CPU forecasting; AWS expands predictive scaling to 6 new regions signaling continued investment; CNCF analysis highlights persistent performance/reliability/cost trade-offs in Kubernetes autoscaling with KEDA and Karpenter; production maturity guides document stable HPA adoption patterns. Ecosystem remains mature and universal; vendor and practitioner focus shifts entirely to reliable deployment in layered systems and specialized workload categories.
2026-Jan: Operational reliability challenges persist despite ecosystem maturity: Azure Kubernetes Service autoscaling failures documented in production, with node pool scaling stuck due to quota limits, capacity constraints, or subnet IP exhaustion. Platform foundation remains stable for simple use cases; complex multi-layered orchestration continues to require careful failure recovery planning and manual intervention fallbacks.
2026-Feb: Ecosystem maturity confirmed with production evidence and novel research directions: Calendly demonstrates predictive HPA deployment using Datadog time-shifted metrics eliminating hourly traffic spike latency; Aalborg University research validates 34.68% energy reduction via NeuroScaler; CNCF reports 74% enterprise adoption with Q1 2026 fintech case study showing 22% cost reduction via Karpenter. Google Cloud and practitioners publish deployment guides. Critical assessment notes persistent lack of public case studies despite ecosystem technical maturity, highlighting documentation gap. Platform capability remains table-stakes; deployment complexity in multi-layered scenarios remains primary constraint.
2026-Apr: Evidence confirmed predictive autoscaling's expanding role in AI inference workloads while documenting persistent barriers specific to that domain. Simplismart.ai achieved 60-70 second GPU inference scale-up using EC2 warm pools, down from 5-6 minutes previously, validating the warm pool strategy for AI workloads. NVIDIA's Dynamo SLA Planner extended ML-based capacity forecasting (ARIMA, Kalman filter, Prophet) to GPU infrastructure, and Amazon published ACM SoCC research on ensemble forecasting algorithms for Redshift Serverless. Blizzard published an operational Kubernetes autoscaling playbook for predictable game-launch spikes. A KubeCon EU practitioner analysis identified structural adoption barriers for AI inference: GPU cold-start delays of 30-120 seconds make reactive HPA inadequate, and token latency variation undermines threshold-based scaling—reinforcing that predictive pre-scaling is now the minimum viable approach for LLM serving infrastructure. Enterprise production deployments show consistent ROI: MongoDB's experiment on 10K production clusters validated cost savings at 9 cents/hour per replica set; Grab achieved 55% infrastructure cost reduction via KEDA on Kafka consumers (CPU utilization 15%→57% while maintaining SLA and zero data loss); Uber operates Capacity Recommendation Engine for thousands of microservices. KEDA reached CNCF Graduated status (August 2023), cementing event-driven autoscaling as an industry-standard capability endorsed by vendor-neutral governance. A critical practitioner assessment identified a fundamental modeling risk: many teams build forecasting models on inherently unforecastable time series, with diagnostic techniques to test forecastability before modeling now recommended as a pre-deployment prerequisite.
2026-May: Ecosystem maturation confirmed with continued vendor investment and industry-wide capacity constraints emerging as primary bottleneck. Google Cloud (April 2026) launched intent-based autoscaling for GKE with 5x faster reaction time (25s→5s) and native custom metrics, eliminating external observability stack dependencies. AWS confirmed continued platform investment; Cast AI and Zesty expanded coordinated HPA/VPA optimization — Zesty platform GA demonstrates 40% cluster optimization and 10% size reduction at production scale. KEDA queue-depth scaling for vLLM inference achieved 40% GPU spend reduction and 60% p99 latency improvement, validating event-driven scaling over CPU-metric approaches for AI workloads. Critical signal: Datadog production analysis established capacity limits as the primary operational bottleneck — 60% of AI request failures directly caused by capacity constraints across thousands of production systems. Market analysis projects capacity management reaching $6.47B by 2030 (24.4% CAGR) driven by AI-powered predictive analytics. Practitioner assessments emphasize that simple deployments remain reliable while multi-layered scenarios require operational discipline; feedback-loop engineering costs and forecastability testing prerequisites remain significant implementation barriers.