Time series forecasting

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI models that forecast future values from historical time series data across demand, revenue, usage, and other metrics. Includes deep learning forecasting and automated model selection; distinct from financial forecasting which applies time series to a specific finance context.

OVERVIEW

AI-driven time series forecasting has reached the point where forward-leaning organisations extract real value from it -- but most have not yet started, and the field's central question remains unresolved. Neural and foundation model approaches (Transformers, TimeGPT, TimesFM) promise zero-shot generality across demand, revenue, and operational metrics, yet empirical evidence stubbornly shows that simpler methods -- gradient boosting, ARIMA, exponential smoothing -- match or beat them on most production workloads. The M4 Competition, repeated benchmarking studies, and practitioner case studies all converge on the same finding: model performance is task-dependent, not architecture-dependent. What makes this a leading-edge practice is not proof that deep learning wins, but that a mature vendor ecosystem, cloud-managed services, and confirmed multi-sector deployments have made automated forecasting accessible at scale. The tension that defines this tier is method selection: organisations can deploy forecasting today, but choosing when neural complexity justifies its cost over classical alternatives still requires domain expertise and empirical validation rather than default architectural commitment.

CURRENT LANDSCAPE

The vendor ecosystem is consolidating around foundation models even as evidence mounts against their universal superiority. AWS completed its deprecation of Amazon Forecast, retreating from specialised forecasting-as-a-service -- a significant signal from the category's largest cloud provider. Foundation model vendors filled the gap: Google released TimesFM 2.5 (March 2026) with 200M parameters and 16k context length (8x expansion), integrated into BigQuery ML and Google Sheets for consumer-grade accessibility; Amazon Chronos-2 achieved 600M+ HuggingFace downloads and added multivariate/covariate support; Salesforce released Moirai-MoE with sparse mixture-of-experts outperforming larger rivals at 28x parameter efficiency. Enterprise adoption is real but narrow -- 62% of enterprises report increased predictive analytics demand, and documented deployments span retail (The Very Group: 9.9% SKU management improvement across 8M+ forecasts), manufacturing (Foxconn: 8% accuracy gain, $553K annual savings), energy (renewable forecasting achieving 14% balancing cost reduction), and healthcare (peer-reviewed mortality/discharge prediction). Yet peer-reviewed research keeps undermining the case for model complexity: a billion-scale benchmark (QuitoBench) on Alipay data reveals that deep learning matches foundation models with 59x fewer parameters at short context lengths, while transformer-based models underperform simple linear models on financial data due to variance-driven error. Theoretical work has formalised non-zero error bounds tied to partial observability, while calibration studies confirm TSFMs maintain reliable uncertainty estimates—enabling deployment in high-stakes domains but not resolving the core efficiency tension. Perhaps the most consequential finding is that traditional accuracy metrics (MAPE, MAE) correlate poorly with economic outcomes -- optimising for forecast precision can actually reduce profitability by ignoring pricing, substitution, and agency effects. The field's real barrier remains not which model to choose but whether forecasting teams are optimising for the right objective and whether zero-shot generic foundation models offer genuine ROI over domain-specific fine-tuning.

TIER HISTORY

ResearchJan-2017 → Jan-2017

Bleeding EdgeJan-2017 → Jan-2019

Leading EdgeJan-2019 → present

EVIDENCE (143)

Why Model Selection Fails in Time Series Forecasting: An Empirical Study of Instability Across Data RegimesResearch Papers2026-05-02

— Negative signal: empirical evidence of fundamental rule-based model selection failures, documenting context-dependent performance instability and practical maturity challenges.

Explainable Load Forecasting with Covariate-Informed Time Series Foundation ModelsResearch Papers2026-04-30

— Production TSO deployment: empirical evaluation of TSFMs (Chronos-2, TabPFN-TS) on energy load forecasting, zero-shot competitive with task-specific models.

Engineering Uncertainty Estimation in Neural Networks for Time Series Prediction at UberCase Studies2026-04-28

— Uber production deployment: Bayesian neural networks for demand and anomaly detection with principled uncertainty decomposition (epistemic, aleatoric, distributional shift) at scale.

Scientists Develop Algorithm for Accurate Financial Time Series ForecastingResearch Papers2026-04-28

— Peer-reviewed empirical study benchmarking 200,000+ model configurations with wavelet-based methodology improvements and quantified financial forecasting results.

[논문 리뷰] FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series ForecastingResearch Papers2026-04-27

— Comprehensive energy sector benchmark: TSFMs outperform dataset-specific ML on 54 energy datasets (9 categories), deployment-grade rigor in critical infrastructure domain.

WHY DEMAND FORECASTING FAILS WHEN MARKETS STRUCTURALLY SHIFTOpinion2026-04-26

— Critical assessment documenting failure modes during market regime changes; identifies structural barriers to forecasting effectiveness in real production environments.

TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level EffectivenessResearch Papers2026-04-23

— ICLR 2026 benchmarking framework: 10,000+ experiments evaluating forecasting architecture components; 92% configurations beat SOTA with 5.4% error reduction through systematic design exploration.

Time Series Forecasting with Google's TimesFM 2.5 in MetaTrader 5Case Studies2026-04-20

— Financial sector production deployment: end-to-end TimesFM 2.5 pipeline with fine-tuning on 14 instruments, demonstrating real-world adoption in algorithmic trading with technical implementation evidence.

HISTORY

2017: DeepAR paper introduced autoregressive deep learning forecasting with 15% accuracy gains; Facebook released Prophet for operational time series forecasting; early neural network methods validated on benchmarks but production adoption faced practical challenges with seasonality and robustness.
2018: AWS launched Amazon Forecast as fully managed deep learning service; Amazon disclosed production deployment of time series forecasting across retail supply chain for demand anticipation; peer-reviewed study on 1,045 time series showed classical methods outperforming ML approaches, raising questions about ML adoption viability; Prophet usage revealed practical limitations including accuracy failures on hourly data and sensitivity to changepoints, creating divergence between vendor cloud offerings and practitioner tool effectiveness.
2019: Amazon Forecast reached GA (August) with advanced features (quantile selection); Microsoft deployed curriculum learning LSTMs for financial forecasting with 30% accuracy gains; Prophet adoption expanded into CPG supply chain (1,000+ route forecasting); NeurIPS research advanced Transformer architectures for time series; transport network studies confirmed simpler methods often outperform deep learning for short-term prediction, reinforcing fragmentation between theoretical advances and practical method-selection guidance.
2020: Amazon Forecast expanded ecosystem integration (Anaplan) and deployed advanced metrics; Amazon Redshift achieved 70% improvement in node prediction; Hyperconnect and other companies confirmed Prophet's operational viability; deep learning surveys and domain-specific research advanced neural architectures; M4 Competition decisive results showed pure ML significantly underperformed classical and hybrid methods on 100,000 time series, intensifying core tension between vendor momentum and evidence-based method limitations.
2021: AWS Forecast expanded retail deployments (achieving 10-20% accuracy improvements, 2-3% revenue gains on 985-store networks) and SME accessibility via cloud economics; Nixtla introduced TimeGPT-1 foundation model trained on 100B+ data points, signaling evolution toward pre-trained, zero-shot forecasting; academic research intensified with comprehensive surveys documenting deep learning maturity (RNNs, LSTMs, CNNs, Transformers); however, critical 2021 research directly challenged deep learning necessity, showing gradient boosting regression trees outperforming state-of-the-art DL models on benchmark datasets, further reinforcing the core tension that vendor momentum and academic research advancement had not resolved the fundamental question of when deep learning adds genuine value versus simpler alternatives.
2022-H1: Vendor ecosystem deepened with Snowflake-AWS joint offering for CPG demand forecasting and Amazon Connect integration of forecasting for contact center staffing; methodological research matured with peer-reviewed papers addressing hybrid approaches (LSTM-Prophet for energy) and forecast evaluation rigor, bridging ML and statistical best practices; AWS expanded SageMaker tooling with accessible tutorials combining Prophet, LSTNet, and DeepAR, reinforcing operational consolidation and accessibility despite ongoing core tension between neural advances and simpler method effectiveness.
2022-H2: Enterprise-scale deployments demonstrated maturity: Bosch deployed hierarchical revenue forecasting at million-time-series scale on Amazon Forecast with custom Transformers for COVID-19 volatility; AWS launched what-if analysis feature (80% faster scenario testing); Nixtla maintained open-source momentum with StatsForecast and transparent neural-vs-statistical trade-off guidance; critical empirical pushback: COVID-19 forecasting studies showed Holt-Winters significantly outperforming Prophet, reinforcing evidence that simpler methods often superior to neural approaches on real-world data; hybrid research (ARIMA-ANN, LSTM-Prophet) gained traction, cementing pragmatic multi-method ecosystem.
2023-H1: Real-world deployments expanded: apparel manufacturer saved 1000+ monthly man-hours and achieved 1-2% bottom-line improvement with Amazon Forecast; medical device maker achieved 20% accuracy gains and 8% inventory reduction on 1000+ SKUs with DeepAR. AWS named customer growth continued (More Retail, The Very Group). Foundation model research emerged as new direction, with surveys documenting pre-trained models for cross-domain forecasting. Research identified data scarcity as critical adoption barrier limiting deep learning progress; comparative studies reinforced pragmatic method-selection tensions between deep learning accuracy and classical methods' robustness.
2023-H2: Vendor ecosystem integration deepened: AWS Redshift ML added Forecast integration for SQL-native forecasting; Lenovo deployed enterprise LeForecast platform combining foundation models, multimodal, and hybrid engines for demand and carbon emissions forecasting. E-commerce (bol) achieved 2-5% improvements with sparse hierarchical loss methods; ensemble methods gained traction as pragmatic solution to data heterogeneity and cold-start problems. Critical assessment surfaced: Lokad CEO and practitioners challenged traditional accuracy-focused forecasting metrics, arguing misalignment with business value and questioning ROI of complex methods. Foundation models (TimeGPT) continued evolution. Tension persisted: academic research and vendor innovations accelerated, but practical adoption barriers centered on problem formulation and business value alignment, not methodology.
2024-Q1: Foundation model momentum met skepticism: Nixtla TimeGPT achieved GA with Azure integration and named customers (Ford, Walmart, FedEx, Databricks); Amazon announced Forecast discontinuation for new customers, signaling platform consolidation. However, critical research and practitioner feedback revealed foundation model overfitting on benchmarks and failure on diverse real-world time series, with LLM-based approaches underperforming classical ARIMA; users reported transformer models vastly overfit with limited independent validation. Amazon's Chronos showed mixed results (some successes, documented failures on real data). The defining tension sharpened: vendor ecosystem accelerated foundation model investment while empirical evidence continued showing simpler methods and ensembles retain superiority on practical deployments, raising questions about adoption value versus marketing momentum.
2024-Q2: Vendor consolidation accelerated around foundation models: Microsoft Azure integrated Nixtla's TimeGEN-1 as MaaS with early customers (STIHL, Bridgestone); Google published TimesFM decoder-only model (200M parameters, 100B training data points, ICML 2024); AWS expanded Supply Chain tooling with Forecast Model Analyzer despite discontinuing core Forecast product. Foundation model race intensified with competitive differentiation and cloud vendor positioning, yet deployment reality remained pragmatic: ensemble methods dominated real-world ROI, and critical evidence of foundation model superiority on production data remained sparse. Academic research advanced (TimeCMA LLM integration, domain-specific architectures), and market adoption signals broadened (retail, finance, manufacturing, energy), but the core tension persisted: vendor marketing acceleration outpaced evidence of production value versus simpler alternatives.
2024-Q3: Foundation model momentum met decisive empirical resistance. Salesforce disclosed 70+ production forecasting use cases deliberately multi-model (ARIMA, Prophet, XGBoost, Moirai, TimesFM), avoiding foundation-model-only commitment. Independent research showed gradient boosting significantly outperforms foundation models (Chronos) in volatile domains while FMs excel only in stable trend contexts. Practitioner production case study on European telecom found Prophet's interpretability advantages offset by accuracy limitations and tuning requirements compared to tree-based methods. Vendor platforms continued cloud MaaS positioning (Microsoft, AWS, Nixtla), but evidence gap widened: marketing claims of universal foundation model superiority contradicted by deployment realities of pragmatic, multi-method ensembles and domain-specific model selection.
2024-Q4: Vendor consolidation deepened with both ecosystem expansion and strategic retreat. Nixtla released nixtlar SDK (R CRAN package v0.6.2) for TimeGPT ecosystem breadth; AWS enhanced Amazon Connect with minimal-data forecasting (single-interaction forecasting); foundation model research advanced with comparative studies (pre-trained LSTMs vs small-scale transformers) revealing nuanced strengths and limitations. However, critical empirical counter-evidence mounted: manufacturing domain benchmarking found simpler algorithms (XGBoost, XiBoost) consistently outperform complex SOTA architectures across real datasets, challenging fundamental assumptions about model sophistication. AWS discontinued Amazon Forecast for new customers mid-2024, pivoting to SageMaker Canvas—a major signal of reduced confidence in specialized forecasting service differentiation. Academic syntheses (comprehensive ML method survey) reinforced that algorithm selection must be task-specific, not defaulting to neural or foundation model approaches. The tension sharpened: vendor platforms claimed 1 billion series forecasting scale and promised simplification through foundation models, yet production deployments, platform deprecations, and empirical benchmarking all pointed toward pragmatic multi-method ensembles and skepticism about foundation model ROI.
2025-Q1: Foundation model vendor acceleration met intensifying research skepticism. Nixtla released Python SDK v0.7.3 and continued ecosystem expansion with Azure and Snowflake integration; AWS SageMaker Canvas matured with automated ensemble stacking for retail/CPG forecasting; industry adoption broadened across finance, retail, logistics, and healthcare, signaling market maturity. However, critical research directly challenged foundation model necessity: ICML 2025 workshop findings showed simple PCA+Linear models achieving competitive zero-shot results with SOTA TSFMs, questioning FM complexity and supporting skepticism about universal FM superiority. Practitioner assessment reinforced limitations, noting time-series forecasting remains "one of the last frontiers AI has yet to conquer" with persistent challenges in dynamic pattern and nuanced fluctuation prediction. The three-way divergence deepened: vendor marketing accelerated FM claims, research raised critical questions about FM necessity, and production deployments remained committed to pragmatic multi-method ensembles, signaling that ecosystem maturity coexists with fundamental unresolved tensions on method selection and value-add of deep learning and foundation models.
2025-Q2: Foundation model momentum met practical implementation challenges and empirical skepticism. NeurIPS 2025 research advanced core forecasting methodology (PIR framework for instance-aware bias revision), while vendor ecosystem expanded (TimeGPT Snowflake integration, CRAN nixtlar R package, AWS SageMaker Canvas ensemble automation). Practical benchmarking revealed nuanced foundation model trade-offs: Parseable study showed FMs excel at concurrent stream management but require domain-specific tuning; VN1 retail competition found TimeGPT 2nd (zero-shot) trailing finetuned MOIRAI, indicating zero-shot limitations and exogenous variable dependency. Open-source maturation accelerated with Time-Series-Library reaching 11.7k stars supporting advanced deep architectures (TimesNet, iTransformer). Real-world adoption evidence surfaced persistent barriers: energy deployments confirmed ensemble method superiority over ARIMA; eCommerce practitioners documented forecasting failures despite available tools, highlighting problem formulation and business alignment as primary adoption bottlenecks. The central tension persisted: vendor FM claims and integrations expanded while competition results, benchmarks, and practitioner experience all documented FM limitations requiring domain expertise and hybrid approaches, reinforcing that method selection complexity and business value misalignment remained the core barriers to advancement.
2025-Q3: Major vendor capitulation and empirical pushback marked critical inflection. AWS discontinued Amazon Forecast for new customers (September 2025), signaling strategic retreat from specialized forecasting infrastructure—a decisive consolidation signal despite years of successful production deployments. Foundation model vendors maintained ecosystem expansion (Nixtla SDK maturation, TimeGPT documentation, persistent integrations) yet independent peer-reviewed research from Copenhagen Business School (ITISE 2025) provided grounding: TimeGPT outperformed for weekly granularities but showed weakness for daily/monthly frequencies, directly contradicting universal FM superiority claims. Enterprise adoption remained broad (72% of firms across manufacturing/services deploy time-series forecasting, average MAPE 12.4%) with documented wins (retailers improving accuracy 27%→76% with 20% waste reduction) and persistent practitioner barriers (explainability, hierarchical forecasting complexity, feature engineering challenges in multivariate contexts). The defining tension sharpened: major cloud vendor retreat signaled reduced confidence in forecasting-as-a-service differentiation; FMs advanced claims while independent benchmarking and practice documented significant limitations, reinforcing pragmatic multi-method ensemble selection and highlighting that method complexity and business misalignment—not model architecture—remained primary adoption barriers.
2025-Q4: Discipline consolidated into pragmatic equilibrium with evidence-grounded skepticism of neural and foundation model necessity. AWS completed Forecast deprecation transition with no new customer support; simultaneously, foundation model vendors expanded ecosystem integration (TimeGPT on MindsDB enabling SQL-native forecasting, continued cross-platform availability). Critical peer-reviewed research established formal limits: journal study found zero statistically significant differences among seven competing models on transformer load forecasting with all models failing under extreme volatility; theoretical research identified fundamental non-zero error bounds due to partial observability and exponential complexity-performance relationship, formally grounding practitioner skepticism. Market adoption remained strong (62% of enterprises cite increased predictive analytics demand, 71% of data scientists adopt zero-shot/FM approaches) with growth projections (5-6% CAGR to $0.47B by 2033). However, practitioner assessment identified core business misalignment: traditional accuracy metrics (MAPE, MAE) correlate poorly with economic outcomes and can reduce profitability by ignoring pricing, substitution, and agency—revealing that optimization target misalignment rather than methodology complexity remained the primary adoption barrier. State at year-end: ecosystem matured with expanded tooling accessibility; deployments confirmed across finance, retail, logistics, healthcare; yet the defining tension on method selection and business value alignment remained unresolved despite years of intensive research and evidence uniformly supporting pragmatic multi-method ensembles over singular architectural commitment.
2026-Jan: Vendor innovation and deployment evidence continued despite platform consolidation. Amazon Forecast legacy customers reported sustained ROI: The Very Group improved SKU management 9.9% (£110M value across 8M+ forecasts); More Retail increased produce accuracy 27%→76% (20% waste reduction); Foxconn achieved 8% accuracy improvement (annual savings $553K). NVIDIA advanced methodology with scalable probabilistic TSF framework outperforming GenCast and Integrated Forecasting System without specialized architectural constraints. Peer-reviewed healthcare research validated TSF on mortality/discharge prediction. Foundation models maintained ecosystem momentum (TimeGPT documentation). Cutting-edge research (SEER arXiv preprint) addressed data quality robustness via transformer-based patch enhancement. The defining state: despite AWS platform deprecation, empirical deployment evidence and methodological research reinforced that pragmatic multi-method ensemble selection—driven by domain-specific requirements and business value alignment—remained superior to singular architectural commitments.
2026-Feb: Google advanced MLP-based architectures (TiDE) with 10.6% MSE improvements over transformers and 5-10x faster inference. Foundation model vendors continued ecosystem expansion (TimeGPT, TimesFM integration) while peer-reviewed research simultaneously advanced evaluation rigor (TIME benchmark with 50 fresh datasets) and highlighted persistent method-selection challenges (AHSIV framework for horizon-induced degradation). Empirical study on Australian electricity market documented continued performance degradation of SOTA models under volatility, reinforcing practitioner skepticism about model complexity necessity. Emerging research (agentic TSF position paper) proposed paradigm shift toward adaptive, context-aware forecasting. The defining state: methodological innovation accelerated (MLP efficiency, evaluation standards, agentic frameworks) while real-world evidence continued showing pragmatic multi-method ensembles and domain expertise remain essential—ecosystem momentum building on foundation model infrastructure does not yet translate to resolved tensions on method selection or business value alignment.
2026-Mar: Vendor retreat and foundation model maturity sharpened the leading-edge tension. AWS formally closed Amazon Forecast to new customers (March 9), confirming 2025 deprecation and marking major cloud provider pullback despite prior enterprise success (Very Group, Foxconn, More Retail documented savings). Simultaneously, Google released TimesFM (versions 1.0-3.0 on Hugging Face, 24.8k+ downloads) and Amazon published Chronos benchmarks, demonstrating continued vendor investment in foundation models. However, critical peer-reviewed research mounted evidence against TSFM universality: a taxonomy-aware evaluation critique (Saqur et al., 0.85 confidence) argued DL improvements are marginal and context-specific, challenging benchmark conclusions; practitioner assessment (Shako, Federal Reserve/Amazon/Stripe experience, 0.75 confidence) documented TSFMs achieving only ~35-40% skill improvement over naive baselines and losing to classical methods—essential negative signal for tier sustainability. Independent third-party benchmark (Parseable, 0.75 confidence) on real observability telemetry validated zero-shot TSFM generalization and probabilistic calibration for production alerting. Methodological research (VLDB 2024 paper, 0.78 confidence) on standardized benchmarking signaled field maturity in evaluation infrastructure. Architectural innovation (Amazon WaveToken, 0.76 confidence) advanced tokenization efficiency across 42 datasets. State: foundation model vendors maintain ecosystem momentum and governance integrations (Azure, Snowflake, MindsDB, Bedrock), demonstrated zero-shot capabilities on real data, yet AWS platform consolidation signal combined with critical peer-reviewed evidence of limited superiority over baselines reinforce that leading-edge classification remains appropriate—deployments are real and broadening, but universal TSFM necessity remains unproven and method selection complexity persists as primary barrier.
2026-Apr: Foundation model competition intensified with Google releasing TimesFM 2.5 (200M parameters, 16k context length, BigQuery GA integration), Amazon Chronos-2 accumulating 600M+ HuggingFace downloads with multivariate support, and Salesforce releasing Moirai-MoE achieving 28x parameter efficiency over larger rivals. ICLR research confirmed TSFMs maintain reliable calibration (PCE <0.05) enabling deployment in high-stakes domains like healthcare. ByteDance/Tsinghua released Timer-S1, billion-parameter sparse MoE with Serial-Token Prediction, advancing cutting-edge architecture research. However, critical NeurIPS 2024 peer-reviewed evidence revealed LLM-based forecasters do not outperform basic attention mechanisms across 13 datasets—direct refutation of universal TSFM necessity. Financial deployments show real-world adoption (TimesFM in algorithmic trading pipelines), while Amazon Science validated RSight deep neural network on 15M e-commerce products with region-aware learning. Amazon's own demand planning research confirmed production systems prioritize forecast stability over marginal accuracy gains; a University of Tennessee supply chain white paper went further, challenging whether forecasting should be the default planning approach at all—attributing limitations to demand variation and organisational factors rather than methodology gaps. Autonomous demand sensing market reached USD 1.63B (2026, 9.46% CAGR). Evidence continues to converge: transformer-based models underperform simple linear models on financial time series (QuitoBench billion-scale benchmark), grid operators apply ensemble methods (14% balancing cost reduction), and academic critiques challenge forecasting assumptions themselves—indicating that method complexity and business misalignment, not model architecture, define practice boundaries.
2026-May: Empirical deployment evidence and negative signal research reinforced pragmatic ecosystem equilibrium. Uber disclosed production Bayesian neural network forecasting for demand and anomaly detection with principled uncertainty decomposition (epistemic, aleatoric, distributional-shift) at scale, advancing neural uncertainty quantification in operational contexts. Energy sector benchmarking (54 datasets, 9 categories) validated TSFM superiority for load forecasting with production transmission system operator deployment of Chronos-2 and TabPFN-TS with zero-shot competitive performance. ICLR 2026 TimeRecipe framework demonstrated systematic architecture component evaluation across 10,000+ experiments with 92% configurations beating SOTA (5.4% error reduction), advancing methodological rigor. However, critical peer-reviewed research surfaced fundamental barriers: empirical study directly documented rule-based model selection failures with context-dependent performance instability; financial time series benchmarking (200,000+ model configurations) identified wavelet-based methodology improvements with quantified results; LLM benchmark across 8 models (GPT-4o, Claude, DeepSeek, Llama) on 33 time-series reasoning tasks revealed critical operational limitations and need for specialized methods; market regime change analysis documented structural failure modes of forecasting approaches in production environments. The state sharpened: deployment evidence confirmed across finance, energy, demand planning with concrete uncertainty quantification improvements, yet critical research simultaneously documented fundamental method-selection instability, market regime blindness, and LLM unsuitability—reinforcing that technology accessibility and ecosystem maturity coexist with unresolved tensions on model selection, business value alignment, and the boundary conditions defining when forecasting itself should be the planning approach.