Causal inference & uplift modelling

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI techniques that go beyond correlation to estimate causal effects and identify which interventions drive outcomes. Includes treatment effect estimation and counterfactual analysis; distinct from predictive modelling which forecasts outcomes without inferring causation.

OVERVIEW

Causal inference and uplift modelling occupy a frustrating position: the tooling is production-ready, but production adoption remains narrow. Unlike predictive modelling, which forecasts outcomes, causal inference estimates what would happen under a specific intervention — the incremental effect of a marketing campaign, a product change, or a credit offer. Uplift modelling applies this at the individual level, identifying which customers will actually respond to treatment rather than converting regardless.

Major cloud vendors now ship causal inference as GA components, open-source libraries like DoWhy have millions of downloads, and the methodology continues to advance. Yet deployment concentrates almost entirely in e-commerce and marketing, where randomised experiment infrastructure already exists. March 2026 benchmarks confirm that 62% of modern treatment-effect models underperform trivial baselines on real-world data; robustness under structural biases remains a critical gap. Healthcare, policy, and other observational domains generate growing research interest but zero clinical or operational integration. The practice is leading-edge in the precise sense: forward-leaning teams extract real value, but most organisations have not started, and the barriers that prevent broader adoption — data volume requirements, assumption validation complexity, and tool consistency gaps — have proven resistant to the vendor investment thrown at them.

CURRENT LANDSCAPE

Platform and tool investments accelerated into May 2026: Microsoft Azure and Fabric GA components (2025), Alembic's Series B $171M investment in real-time Causal AI platform serving airline/CPG/finance customers, and Causalens launching enterprise GA with named deployments (asset managers, investment banks, transportation, energy). These platforms reach production maturity with enterprise adoption signals. Netflix's decade-long production causal infrastructure spanning localization, recommendations, pricing, and retention optimization demonstrates maturity at scale, though it required PhD-level teams and multi-year investment. Production deployment gains momentum: Remerge's 20+ documented uplift test case studies across mobile marketing (2023-2026) show sustained, scaled RCT-based incremental measurement delivering consistent 30-60% CPA reductions across 100+ campaigns. Cassandra.app platform now serves 100+ marketing teams with geo-based uplift testing; Gina Tricot case demonstrates consistent ROI improvement across markets. Adyen's May 2026 Uplift product (GA) reports 10% conversion lift using causal inference on trillions of payment transactions. Meta's incremental attribution adoption in DTC shows 18% incremental sales growth in geo-test cases. Multi-channel marketing attribution research demonstrates uplift modeling can identify 30% budget discrepancy in traditional attribution. These deployments concentrate in e-commerce and marketing where randomised experiment infrastructure exists; healthcare research interest intensifies but clinical workflow integration remains absent. Real-world deployment barriers persist: April 2026 Amazon Science benchmark confirms 62% of CATE models perform worse than trivial predictors on heterogeneous data; peer-reviewed evidence from major pharmaceutical companies shows single-robust ML estimators can underperform parametric regression without doubly robust methodology and sample splitting, requiring TMLE or AIPW approaches. Production causal inference systems face hidden operational risks — model upgrades can shift causal risk estimates by 0.12-0.19 points and increase confidence interval widths by 23% on protected cohorts, creating deployment instability. Methodological advancement toward practitioner accessibility accelerates: TU Delft research formalizes methods to detect assumption violations and ensure robustness; hierarchical causal models validated on 3M active users demonstrate recovery of incremental effects under treatment overlap; agentic AI frameworks now automate variable selection and graph construction, reducing expert timeline from weeks to hours. Yet these advances have not yet expanded adoption beyond core marketing/e-commerce contexts despite years of vendor investment.

The research frontier is moving toward reliability assessment and practitioner accessibility. New benchmarks at ICLR 2026 exposed critical LLM weaknesses on causal reasoning: when evaluated across Pearl's causal ladder (discovery, intervention, counterfactual), LLMs achieve 93.5% on discovery but degrade sharply to 81.9% on intervention and 73% on counterfactual reasoning, limiting autonomous causal method selection. Methodological work continues to push boundaries with theoretical advances in heterogeneous treatment effects (HTE) clarifying assumptions for mechanism testing, multi-treatment effect identification under unmeasured confounding with √n-consistent estimators, and HTE estimation for survival outcomes with clinical application. NSF-funded educational tooling (thinkCausal with stan4bart) achieves randomized validation showing superior accuracy and speed over alternative methods, advancing practitioner accessibility. Yet these advances remain concentrated in academic and research settings rather than driving operational adoption.

The core adoption barriers are well-documented and persistent: minimum data volumes of 10,000+ per treatment arm, 10-20% ATE estimate divergence between major libraries using identical configurations, and model generalization failure across campaigns. Healthcare has generated over 4,300 clinical publications referencing causal methods, yet systematic reviews find zero integration into clinical workflows. Analyst surveys project enterprise interest — 62% planning a shift toward causal decision intelligence within 18 months — but the gap between stated intent and operational deployment defines this practice's stalled trajectory.

TIER HISTORY

ResearchJan-2019 → Jan-2019

Bleeding EdgeJan-2019 → Jan-2020

Leading EdgeJan-2020 → present

EVIDENCE (130)

How Agentic AI Finally Makes Causal Inference DeployableOpinion2026-05-01

— Agentic AI framework automates causal variable selection and graph construction; reduces expert timeline from weeks to hours—expanding practitioner accessibility.

Geo-Incrementality testing - Cassandra.appCase Studies2026-04-29

— Platform serves 100+ marketing teams with geo-based uplift testing; Gina Tricot case demonstrates consistent ROI improvement across markets.

Do Contemporary Causal Inference Models Capture Real-World Heterogeneity? Findings from a Large-Scale BenchmarkResearch Papers2026-04-29

— ICLR 2025 Amazon/UCLA benchmark: 62% of contemporary CATE models underperform trivial baseline on real-world heterogeneous data—critical reliability limitation.

Grow payment conversion with AI - Adyen UpliftProduct Launches2026-04-28

— Adyen Uplift GA product reports 10% conversion lift using causal inference on trillions of payment transactions; independent Nord Security customer validation.

Hierarchical Causal Uplift Modeling in Overlapping Customer JourneysResearch Papers2026-04-27

— Methodological advance validated on ~3M active users; demonstrates recovery of incremental effects under treatment overlap—addresses real-world multi-channel complexity.

Meta Incremental Attribution, Hyper Relevant Content, and Affiliate MarketingCase Studies2026-04-27

— Meta's incremental attribution adoption in DTC segment; geo-test case study showed 18% incremental sales growth (NY +28% vs CA +10% baseline).

Safer causal inference: Theory and algorithms for falsification, trial augmentation and policy evaluationResearch Papers2026-04-24

— TU Delft dissertation formalizes methods to detect assumption violations and ensure robustness in causal inference—advancing practitioner safety and reliability.

Heterogeneous Treatment Effects and Causal MechanismsResearch Papers2026-04-21

— Theoretical framework clarifying assumptions for using HTEs to test causal mechanisms; reveals theory-practice gap in HTE interpretation foundational to uplift modeling.

HISTORY

2019: Industry-scale causal inference deployments at Uber and other tech companies; open-source libraries (CausalML, DoWhy) reach production maturity; academic research documents both advances in uplift modelling and fundamental limitations in multi-cause inference with hidden confounders.
2020: Tooling reaches stable releases (EconML, DoWhy v0.5+) with education resources on major cloud platforms; applied research validates revenue uplift optimization on e-commerce data; research community focuses on detecting assumption violations and uncertainty quantification, marking transition from pure research to assumption-aware deployment.
2021: Vendor tool expansion (IBM Causal Inference 360, Booking.com UpliftML) signals production deployment in e-commerce and cross-domain applications; interdisciplinary research expansion into NLP and healthcare. Simultaneously, high-profile study shows observational causal inference fails on online platforms (Twitter), and peer-reviewed methodological critique highlights integration barriers—ecosystem becomes more honest about limitations.
2022-H1: Research momentum accelerates with major surveys consolidating methodology and identifying five research domains; applications expand into recommender systems and precision medicine. DoWhy transitions to PyWhy community governance. However, systematic review reveals causal methods adoption remains sparse in applied fields (infectious disease); adoption gaps widen as methodological complexity and assumption-validation barriers persist outside e-commerce/marketing.
2022-H2: Tooling maturity advances (DoWhy v0.9 adds functional API, GCM support, and faster refutations); real-world deployments emerge in fintech marketing with measured cost-per-acquisition gains. However, biomedical benchmarking reveals critical scalability limitations of current methods on real-world data, and healthcare review documents persistent adoption barriers despite theoretical availability. Methodological work focuses on evaluation robustness (RCT-based variance reduction) and assumption validation—ecosystem remains honest about limitations constraining broader adoption.
2023-H1: Vendor ecosystem expands with AWS and Microsoft jointly governing DoWhy through PyWhy (Jan 2023), signaling major cloud provider commitment. Commercial platforms emerge (Causalens, Vianai). However, landmark benchmark study (May 2023) reveals 62% of modern CATE models perform worse than trivial predictors on real-world data, documenting critical validity gaps. Applied research refines decision-tree uplift methods for churn prevention, and emerging research explores LLM-based causal inference—methodology expands even as empirical limitations become clearer.
2023-H2: Google releases cost-aware uplift modeling tooling for marketing optimization. Industry adoption research accelerates in pharmaceutical and healthcare domains (Roche, ICU studies), alongside critical methodological analyses revealing conditions under which uplift approaches underperform classical methods. Open-source ecosystem consolidates with distinct tooling specializations (CausalML vs. EconML) and continued community education (PyCon tutorials). Field demonstrates balanced maturity: expanding applications with honest acknowledgement of real-world performance gaps and adoption barriers outside core e-commerce/marketing use cases.
2024-Q1: Core ecosystem advances with DoWhy-GCM published in JMLR (Jan 2024) extending to causal discovery and root cause analysis; library reaches 3+ million downloads. Industrial deployments continue (Tencent FiT revenue uplift, Hong Kong research on mixed treatments). Methodological focus on reliability: research addresses variance reduction in uplift evaluation (EJOR Feb 2024) and conditions for method success. Expanding application surveys cover recommender systems (Feb 2024) and LLM-causal inference intersections (Mar 2024). Critical analyses of validity gaps emerge: educational and observational data studies document where causal inference assumptions fail, reinforcing that adoption remains concentrated in RCT-capable e-commerce/marketing domains.
2024-Q2: Healthcare adoption signals accelerate: JAMA endorses causal inference frameworks for observational study design (May), and clinical research advances personalized decision support via causal graph learning. Biomedical benchmarking (CausalBench) provides largest open benchmark for causal discovery on real perturbation data. Community dissemination intensifies at SciPy 2024 with practical uplift modeling tutorials. Industrial deployment guides document Uber, Microsoft, and TripAdvisor applications. Core developer perspectives (DoWhy podcast) emphasize LLM augmentation of causal reasoning. Methodologically, focus remains on reliability and real-world performance constraints; adoption signals in healthcare remain research-led rather than clinical-workflow integrated.
2024-Q3: Methodological expansion continues across multiple domains: genetic/genomic causal inference matures with standardized benchmarking (Mendelian randomization validation across 1000+ traits), while integration with deep learning advances via comprehensive surveys. Best Buy industry research validates multi-treatment uplift modeling on real marketing data. Critical analysis persists: HKUST review of 300+ operations management papers documents persistent applicability limits and identification strategy trade-offs in observational research. Academic conference activity (ECML) addresses extensions like limited-supervision uplift modeling. Practitioner insights from Meta emphasize method reliability and performance variability. Field balance remains: expanding applications across genomics, marketing, and deep learning architecture alongside honest assessment of when and where methods succeed or fail in observational practice.
2024-Q4: Ecosystem deployment and critical assessment deepen in balance. Naver Pay releases production double machine learning uplift modeling for multi-treatment marketing optimization; PLOS ONE publishes causal tree/forest application to national health survey data (Australia) for exercise-BMI intervention planning; Journal of Marketing Analytics documents B2B cross-sell uplift modeling effectiveness. Methodological extension emerges for continuous-treatment uplift modeling (CADR with integer programming) tested across healthcare, lending, and HR. However, critical large-scale benchmark (Oct 2024) evaluates 16 contemporary CATE models across 12 datasets and finds 62% perform worse than trivial zero-effect predictors—reinforcing that real-world heterogeneity remains difficult to capture reliably. Practical tool interoperability challenges surface: GitHub issue documents 10-20% ATE estimate divergence between EconML and DoWhy with identical setups, signaling consistency concerns. By year-end 2024, field demonstrates characteristic maturity: expanding deployment applications and methodological sophistication alongside persistent honest documentation of when and where methods fail on real-world data.
2025-Q1: Platform integration signals deepening vendor commitment with Azure ML and Microsoft Fabric releasing causal inference GA components (Feb-Mar 2025) combining EconML and DoWhy into production data science workflows. UMGNet framework advances sparse-data uplift modeling using graph neural networks and active learning to address e-commerce deployment barriers. Real-world application studies expand: DoWhy applied to education analytics with quantified causal effect estimates; B2B and marketing case studies demonstrate incremental value. However, critical perspectives become more visible: Frontiers commentary documents implementation barriers including complexity, data requirements, scalability, and cost obstacles; practitioner analyses highlight A/B testing limitations and argue for uplift modeling while noting organizational adoption challenges (70% false positives in traditional testing, but uplift modeling requires significant capability investment). ICLR 2025 benchmark replicates prior findings showing 62% of contemporary CATE models underperform trivial baselines. Adoption remains concentrated in e-commerce/marketing; no expansion into healthcare, education, or other observational domains despite tool availability. Field maturity manifests through honest literature acknowledging both expanding tool availability and persistent practical adoption barriers.
2025-Q2: Methodological expansion focuses on staggered-adoption scenarios (DiD-BCF) with policy application, MLOps operationalization patterns, and sparse-data approaches. Healthcare research interest grows substantially: bibliometric analysis documents 4,316 clinical publications with emerging big data focus, though clinical workflow integration lags. Tool interoperability concerns surface: GitHub issues document 10-20% ATE divergence between EconML and DoWhy. Accessibility/reframing efforts emerge (causal inference as prediction under distribution shift) targeting broader practitioner adoption. Adoption expansion remains stalled outside e-commerce/marketing despite 18+ months of vendor platform integration. By mid-2025, field demonstrates characteristic mature-technology pattern: sophisticated methodology, expanded research interest in healthcare and observational domains, persistent production barriers (complexity, assumption validation, tool consistency) that have resisted mitigation, and sustained concentration of real-world deployment in RCT-capable marketing contexts.
2025-Q3: Methodological innovation accelerates for real-world constraints: Booking.com advances uplift under network interference via profit optimization; position papers emerge (ICML) arguing rigorous synthetic experiments are essential for validating reliability before broader adoption. Tooling innovation focuses on lowering barriers: LLM-empowered co-pilots (CATE-B) automate causal discovery and method selection. Leading statisticians (Imbens et al.) publish major research highlighting open challenges across statistics, biomedical, and social science domains. However, field's core adoption challenge remains unresolved: systematic reviews document zero causal inference adoption in healthcare AI (immunotherapy, 126 papers), and ICLR 2025 benchmark replicates prior finding that 62% of contemporary CATE models underperform trivial zero-effect predictors on real-world heterogeneity. By Q3 2025, field demonstrates mature stasis: sophisticated methodological and tooling development, explicit recognition by leading voices that fundamental adoption barriers persist, and no expansion into healthcare, observational, or other non-marketing domains despite continued ecosystem maturation and capability availability.
2025-Q4: Ecosystem expansion into new sectors: Esri integrates causal inference analysis into ArcGIS Pro for geospatial effect estimation (Nov 2025); healthcare research interest intensifies as major biostatistics symposium emphasizes causal inference's role in clinical research. LLM-causal synergies emerge as research direction in survey literature. However, practitioner-driven critical assessment dominates: opinion literature highlights concrete barriers—data volume requirements (10k+ treatment/control), model generalization failure across campaigns, and cost-benefit analysis showing uplift requires significant organizational capability investment. Tool consistency issues persist (10-20% ATE divergence between libraries). Adoption expansion remains stalled outside e-commerce/marketing; healthcare remains research-led with zero clinical workflow integration despite intensified research recognition.
2026-Jan: Academic and analyst ecosystem signals accelerate: Harvard CAUSALab formalizes causal inference training at leading public health institution; American Economic Association conference elevates causal methods for macroeconomic applications; Harvard Data Science Initiative demos GenAI-powered causal inference frameworks. Industry analyst theCUBE Research predicts 2026 emergence of Causal AI Decision Intelligence with 62% of enterprises planning adoption shift within 18 months, positioning causal methods as critical for trustworthy agentic AI decision-making. Research interest in healthcare automation expands (Miguel Hernán lecture on AI-driven causal research). Practitioner sophistication in evaluation metrics deepens (Meta methodological work on specialized metrics). Window is primarily training and forward-looking analyst prediction rather than new production deployments; adoption expansion remains concentrated in prior domains with expanded research signaling in healthcare and macro domains.
2026-Feb: Evaluation framework maturation accelerates with community emphasis on reliability before adoption: WSDM 2026 CausalBench workshop (Feb) organizes benchmarking collaboration; arXiv introduces CausalReasoningBenchmark (173 queries across 138 datasets) revealing LLM identification gaps (84% strategy, 30% full specification), and ICLR debuts CausalPitfalls benchmark exposing LLM failures on statistical pitfalls. Methodological advances address complex real-world scenarios: combinatorial treatment uplift learning, time-series causal discovery (econometric vs. ML comparison on UK COVID data), and longitudinal ordinal outcome inference for healthcare. LLM-causal integration shows research interest but evaluation reveals critical reliability gaps. Practitioner barriers persist unchanged: tool consistency issues, data volume requirements (10k+), campaign generalization failure. Adoption remains stalled outside e-commerce/marketing despite ecosystem maturity; healthcare remains research-only.
2026-Mar: Amazon Science benchmark confirms 62% of CATE models underperform trivial predictors on real-world heterogeneous data; Netflix published a detailed account of decade-long production causal infrastructure spanning localization, recommendations, pricing, and retention — demonstrating maturity at scale while documenting PhD-level team requirements and multi-year investment barriers. Alembic launched real-time Causal AI platform v3.0 (Series B, airline/CPG/finance customers); precision medicine applications advanced with a digital health HTE study (1,113 employees, 5.2% uncontrolled hypertension reduction by subgroup) and open-source sensitivity analysis tools for observational HTE — but adoption concentration in e-commerce and marketing remains unchanged.
2026-Apr: Research pushed toward practitioner accessibility: InferenceEvolve demonstrated LLM-guided evolutionary frameworks automating causal method selection, Causal-Audit introduced time-series assumption validation (78% abstention on severe violations), and peer-reviewed research confirmed single-robust ML estimators underperform doubly-robust methods (TMLE, AIPW) — reinforcing known reliability gaps. METER benchmark (4,145 items) revealed sharp LLM performance degradation across Pearl's ladder (93.5% causal discovery vs 73% counterfactual), limiting LLM-assisted causal automation. CausaLens launched enterprise GA causal AI platform with named customers across asset management, investment banking, transportation, and energy. Production deployment barriers documented: model upgrades in causal inference pipelines shifted risk estimates by 0.12-0.19 points and increased confidence interval widths 23%, creating deployment instability. Remerge published 20+ RCT-based uplift case studies (2023-2026) showing 30-60% CPA reductions across 100+ mobile marketing campaigns, providing the strongest documented production evidence base for the practice. Adoption remained concentrated in e-commerce and marketing; no clinical workflow integration despite sustained healthcare research interest.
2026-May: Production deployment in marketing and fintech continued to accumulate: Adyen Uplift GA reports 10% conversion lift using causal inference on trillions of payment transactions (validated by Nord Security); Cassandra.app serves 100+ marketing teams with geo-based uplift testing with documented ROI improvement across markets; Meta incremental attribution geo-tests demonstrated 18% incremental sales growth. Critical reliability signal reinforced: ICLR 2025 Amazon/UCLA benchmark confirmed 62% of contemporary CATE models underperform trivial baselines on real-world heterogeneous data. Practitioner accessibility improving via agentic automation (variable selection and graph construction reduced from weeks to hours), and TU Delft research formalised assumption-violation detection methods. Adoption concentration in e-commerce and marketing remained unchanged despite growing tooling accessibility and deepening evidence base.