The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that helps design experiments, determines sample sizes, analyses results, and identifies statistically significant outcomes. Includes automated experiment design and Bayesian analysis; distinct from marketing attribution which analyses campaign effectiveness rather than product experiments.
A/B testing is how product teams validate hypotheses, and it is now firmly established practice. With 78% of firms conducting experiments and a mature vendor ecosystem processing trillions of events daily, the question is no longer whether to A/B test but how to do it well. That distinction matters, because the practice's defining paradox has survived every wave of platform improvement: tooling is sophisticated, but execution remains fragile. Analysis of over 10,000 real-world tests found only 12% delivered meaningful business impact, and false positive rates persist above 26% even at organisations like Microsoft and Netflix. Platform consolidation (Datadog-Eppo, OpenAI-Statsig, Adobe-DynamicYield) reflects industry recognition that experimentation infrastructure is strategically essential as AI accelerates deployment velocity. Warehouse-native architectures, Bayesian inference, and sequential testing with variance-reduction techniques have made experimentation faster and more accessible than ever. However, practitioner-level errors remain endemic: approximately 57% of experimenters p-hack, inflating false discovery rates from 33% to 42%; 72% of first experiments contain mistakes; and teams often fail to deploy test winners, accumulating technical debt that silently erodes conversion by 0.5-1% per 100ms latency. The gap between platform capability and practitioner skill -- not technology -- is what separates organisations that extract value from those that do not.
Vendor consolidation has accelerated in Q1 2026, with Datadog's $220M Eppo acquisition (May 2025) followed by OpenAI's $1.1B Statsig acquisition (September 2025). These moves reflect strategic recognition that experimentation infrastructure is essential as AI accelerates feature velocity. Datadog's May 2026 general availability of Experiments platform integrates A/B testing, business metrics, product analytics, and APM observability—signaling the industry direction toward embedded, warehouse-native, integrated platforms with native guardrail support. Statsig's named deployments tell the scale story: OpenAI runs hundreds of experiments across hundreds of millions of users, Notion scaled from single-digit to 300+ quarterly experiments, and HelloFresh achieved 60x speedup in its Bayesian testing pipeline through computational optimisation. Uber's production Experimentation Platform (XP) documents 1,000+ simultaneous experiments with sequential testing (SPRT), causal inference, and multi-armed bandits across driver, rider, Eats, and Freight—the current gold standard in infrastructure. Bayesian methods have gone mainstream—58% of large organisations now prefer them over frequentist approaches, and enterprise adoption grew 45% year-over-year. Real-world deployment evidence confirms operational maturity: 6,899 ecommerce A/B tests document deployed patterns (discount framing, social proof skepticism, copy effects); DoorDash runs 12,000+ experiments annually across 42M monthly active users with multi-sided marketplace testing; Momentum Nexus scaled from 6 to 53 tests/quarter through systematic hypothesis mining and infrastructure discipline. Warehouse-native platforms are expanding enterprise adoption: Wikimedia Foundation deployed GrowthBook for automated experiment decision frameworks with configurable stopping criteria (Clear Signals vs Do No Harm thresholds), exemplifying how modern teams integrate statistical engines with governance rules. June 2026 scanning reveals vendor acceleration toward AI-driven automation: Optimizely's Opal framework launched specialized agents for experiment design (QBR generation, value estimation, backlog prioritization), while GetYourGuide's sequential testing deployment achieved 40% reduction in experiment cycle times. Shopify's portfolio of 36 live winners across 1,000+ Plus-tier stores documents $2.3M+ monthly aggregate revenue lift, validating that real deployments continue to compound incremental gains across collection pages, cart, and product detail surfaces. Market continues expanding—USD $1.67B (2026) projected to $4.82B (2036) at 11.2% CAGR, with key technology shift from client-side JavaScript to server-side architectures enabling feature flag integration and privacy compliance.
Yet platform sophistication has not closed the execution gap. Ronny Kohavi (ex-VP Airbnb, experimentation researcher) documents industry median success rate of 10% with 22% false positive risk, far below Microsoft's 33%. Analysis of 2,101 Optimizely experiments reveals ~57% of teams p-hack, inflating false discovery from 33% to 42%. A January 2026 CRO analysis found 72% of first experiments contained mistakes, with documented losses from false positives reaching 42% annual revenue. Documented failures at Etsy, Duolingo, Heap, and Posthog show that early stopping, metric misalignment, novelty bias, and deployment errors remain endemic even at sophisticated organisations. Google and Meta have been found to market observational methods as randomised experiments, undermining causal claims. Even Uber's platform evolution illustrates the structural challenge: Morpheus, the internal platform built 7+ years earlier, required complete re-architecture in 2020 because "large percentage of experiments had fatal problems" and "core abstractions supported only a very narrow set of experiment designs correctly." This pattern—platform correctness failures at scale despite vendor maturity—reflects that execution fragility is a platform problem, not solved by better tooling. April 2026 industry benchmarks from Foundry CRO document the adoption-execution gap sharply: 77% of companies claim A/B testing but less than 0.2% actively experiment; of those who do, only 36.3% achieve statistically significant wins with median uplift of 1.88%. A new limitation emerged in Q1 2026: on dynamic platforms like TikTok and in AI-driven search environments, rapid traffic composition shifts (AI search traffic 5x higher CVR than traditional search) break test representativeness, making historical results non-predictive of future performance. June 2026 analysis reveals emerging class of failures in AI system testing: embedding model drift between test windows, feature flag leakage into model inputs, shared memory contamination across variants, and sparse signal problems (23 daily users require months of testing) document novel confounds that violate classical experimental assumptions. Mobile environments continue to present structural barriers: achieving statistical validity on 3.2% baseline CVR requires 50,000+ users committed to a single test, forcing underpowered designs or qualitative evaluation in traffic-constrained contexts. Adoption is broad but shallow: 60% of companies run fewer than five tests monthly. The practice has also accumulated technical debt from incomplete deployments—teams leave winning tests at 100% in A/B testing platforms rather than shipping to production, creating silent conversion losses of $150–300K monthly at scale through accumulated latency and unnecessary DOM mutations. AI-powered entrants promise automated hypothesis generation, but the structural challenge persists—scaling experimentation requires statistical literacy, operational discipline, and infrastructure investment that no platform can substitute.
— Yu Zhang et al. systematically address CUPED methodological nuances for complex scenarios (multi-arm experiments, two-stage designs); findings deployed and validated in production at ByteDance's experimentation platform serving 1000+ concurrent tests.
— Wanted Lab (Korea's largest AI recruitment platform, millions of users) deployed formalized A/B testing culture achieving 150% landing-page sign-up lift; democratized data access across product and marketing with Amplitude, building self-sufficient experimentation cycles (1300+ charts annually).
— Persson et al. develop statistical foundations for using LLMs as surrogate endpoints in A/B testing; empirical validation on Upworthy shows LLM-only predictions recover 39% of human treatment effects, nonparametric calibration closes gap; formal theoretical requirements for validity.
— Google Ads GA feature enables structured A/B testing of creative asset groups within Performance Max, addressing long-standing 'creative black box' problem; supports asset group comparison, seasonal creative, and AI-generated asset validation with MCC/API access.
— Critical analysis of A/B testing validity challenges in AI systems: non-determinism breaks stable-treatment assumption, metric definition ambiguity, and outcome variance make classical statistical framework inapplicable; proposes sequencing evals before tests and treating variants as parameterized systems.
— Zen van Riel's structured guide to A/B testing AI agents; specific requirements: 10K+ interactions per variant, evaluation harness with gold dataset (200-500 curated inputs), segregated logging, multi-metric evaluation (success, latency, hallucination, cost); addresses stochasticity and component isolation challenges.
— Optimizely's three new experimentation AI agents (QBR generation, value estimation, backlog prioritization) demonstrate major vendor shift toward AI-driven experiment design and business impact quantification, signaling platform evolution toward orchestration automation.
— Deep technical case study of Google's fleet-wide experimentation infrastructure addressing scale challenges: deterministic assignment consistency, exposure logging for causal inference, overlap management, and safety guardrails across interconnected services.
2018: A/B testing platforms achieved market dominance with VWO and Optimizely as primary competitors; real deployments generated documented six-figure ROI, but external validity and statistical interpretation emerged as limiting factors. Bayesian methods explored as alternative to frequentist testing, though practical deployment challenges remained.
2019: NBER research confirmed real-world adoption impact on startup performance; defensive methodologies (A/A testing, SRM checks) became standard practice. Data quality issues in platform implementations and Safari ITP privacy changes emerged as major adoption barriers, forcing infrastructure re-architecture and increasing testing costs.
2020: A/B testing tooling matured (VWO expanded to full experimentation platform; A/B Smartly entered market with single-tenant offering) but adoption remained constrained by practical barriers: traffic thresholds (minimum 20K visitors), methodological confusion (Bayesian claims questioned by experts), implementation failures (production cases showed cost of underpowered tests), and organizational preconditions (process maturity, resources). Browser privacy pressure intensified, forcing client-side-to-server-side migration across enterprise deployments.
2021: A/B testing expanded into healthcare: NYU Langone Health published peer-reviewed case study integrating A/B testing into EHR systems for clinical decision support, extending beyond e-commerce. Optimizely's free-tier Rollouts plan lowered entry barriers for startups and SMBs. Grassroots adoption continued (300+ startup founders documented as active practitioners) but vendor ROI claims faced critical scrutiny. Methodological errors persisted despite increased platform sophistication. Infrastructure complexity remained the primary adoption barrier for enterprises.
2022-H1: A/B testing platforms matured with clear financial validation: Optimizely customers achieved 286% ROI with six-month payback; peer-reviewed research confirmed 30-100% startup performance improvement. Yet deployment fragility increased visibility: Netflix experiments revealed systematic measurement bias on congested networks (5-15% misattribution); performance trade-offs emerged with anti-flicker snippets causing 3.3-second LCP penalties. Large-scale data showed 50%+ test failure rates, with some domains exceeding 90%—indicating that platform maturity did not translate to execution maturity. Tool adoption expanded into healthcare and physical products (Cefaly medical device), validating cross-vertical deployment. Practitioners continued systematic methodological errors despite platform sophistication, and minimum traffic thresholds (20K visitors) plus engineering complexity remained binding constraints on adoption.
2022-H2: A/B testing vendor landscape shifted with Google Optimize's announced sunset, forcing mid-market user migration. Platform competition consolidated around specialized entrants: VWO sustained G2 leadership (6th consecutive) across experimentation categories; Eppo emerged with modular feature-flag-plus-testing architecture. Methodological sophistication expanded with multi-armed bandit and sequential testing offerings. Meta-analysis of 1,001 tests captured real-world H2 2022 deployment patterns and outcome distributions. Critical assessments from growth companies documented persistent limitations: effect-size variability, sequential testing pitfalls, and endemic practitioner errors in hypothesis interpretation, reinforcing that platform maturity remained decoupled from execution maturity.
2023-H1: A/B testing infrastructure evolved at scale: Statsig's infrastructure migration to BigQuery handled 30B+ events daily with real-time metrics capabilities, validating enterprise platform maturity. Practice expanded into generative AI: named deployments by WhatNot, Captions, and Notion demonstrated A/B testing methodology applying to LLM parameter optimization. Methodological debates persisted: academic papers continued to critique Bayesian approaches and their practical adoption, while practitioner perspectives highlighted persistent pitfalls (local optima, metric misalignment). Automated experiment design research (MIT's AutODEx) advanced the methodology frontier, but execution fragility remained the limiting factor for most organizations.
2024-Q1: A/B testing deployment continued at scale: Netflix validated platform maturity with 20-30% viewing lift from image A/B tests; Statsig processed 1+ trillion events daily with named customers (Brex, Ancestry, Notion, Lime) reporting 9-30x experimentation velocity increases. However, critical research emerged identifying fundamental measurement bias: SMU/Michigan study demonstrated algorithmic confounding in ad platform A/B tests where targeting optimization can reverse effect signs, invalidating results. Practitioner consensus consolidated around scope limitations: clear guidance emerged on scenarios where A/B testing should not be used (insufficient randomization units, large redesigns), with alternatives like interrupted time series gaining traction. Vendor pricing and lock-in concerns remained adoption barriers despite platform maturity.
2024-Q2: A/B testing infrastructure matured at large-scale deployments: Adevinta's internal 'Fisher' package (Python-based) achieved over 90% adoption across Marktplaats, reducing hands-on experiment time from days to 3 hours per test, and freed 9 weeks annually at scale. Platform vendor consolidation accelerated with Optimizely Full Stack sunset (July 2024), forcing mid-market migrations and re-evaluation of experimentation architecture. Methodological sophistication expanded with Bayesian and sequential testing becoming standard features across platforms. Ecosystem remained stable with VWO, Optimizely, Eppo, Statsig, and A/B Smartly competing on feature depth, ease of use, and data integration; practitioner focus shifted toward internal infrastructure and process optimization over platform selection.
2024-Q3: A/B testing platforms matured at enterprise scale with warehouse-native architectures: Statsig deployments at Bloomberg, HelloFresh, and Grammarly validated advanced statistical methods (CUPED, Winsorization) for large-scale experimentation. Real-world case studies documented concrete ROI: Quip achieved 4.7% order conversion lift; ATG reached 10% checkout conversion improvement through feature flag A/B testing. Market adoption breadth extended to 71% of companies (per Worldmetrics), yet execution depth remained shallow with 60% running fewer than five tests monthly. Critical analysis emerged identifying persistent methodological gaps: false positive rates (36% of significant results despite 10% true effect rates), sequential testing pitfalls, and novelty bias undermining external validity—suggesting platform maturity masked practitioner skill gaps.
2024-Q4: A/B testing market matured with sustained vendor investment and Bayesian methodology adoption. Global market projected at $850M+ in 2024 (14% CAGR through 2031) with 77% of firms worldwide conducting A/B testing, though execution depth remained uneven: 71% run 2+ tests monthly while 60% remain below five tests monthly. Tool ecosystem expanded from 230 to 271 platforms in one year, signaling market growth and competitive differentiation. Statsig and Eppo refined warehouse-native approaches; Bayesian methods became standard alongside frequentist techniques. Real-world deployments generated documented ROI: Discovery Communications achieved 6% video engagement lift; ComScore reached 69% lead generation increase. Methodological research advanced (sample size, prior selection challenges), yet the bifurcation persisted—technology leaders operationalized sophisticated testing while most enterprises faced adoption barriers despite readily available platforms.
2025-Q2: A/B testing vendor consolidation accelerated with Datadog's $220M Eppo acquisition, signaling industry convergence on integrated infrastructure platforms. Statsig achieved $1.1B valuation with Series C $100M raise, validating warehouse-native experimentation at scale. Methodological advancement continued: Harvard/Netflix research demonstrated anytime-valid inference enabling sample-size reduction via continuous monitoring; LinkedIn production deployments validated doubly robust statistical methods for non-Gaussian distributions. Yet implementation fragility remained critical: CRO agency analysis of 7,200 tests found 72% of first experiments contained mistakes, with worst-case documented 42% annual revenue loss from false-positive deployment—signal that platform sophistication persists decoupled from execution maturity. Multiple testing pitfalls highlighted (20 concurrent tests = 64% spurious significant result risk), reinforcing persistent methodological barriers despite tools. Market adoption extended to 77% globally, yet execution shallow (60% run fewer than five tests monthly).
2025-Q3: A/B testing platforms matured at scale with sustained methodological innovation. Autotrader deployed production Bayesian framework handling tens to hundreds of tests monthly, advancing practitioner adoption of advanced statistical methods. Named enterprise deployments validated ecosystem maturity: OpenAI scaled to hundreds of experiments across hundreds of millions of users; Notion increased from single-digit to 300+ quarterly experiments; Brex consolidated vendors for 20% cost reduction. Yet critical analysis of 10,000+ real-world tests found only 12% deliver meaningful business impact, and practitioner-level failures persisted—cases like Posthog's social login test and Doordash's attribution challenges revealed endemic implementation pitfalls even at sophisticated organizations. Implementation barriers remained structural: low-traffic startups faced insurmountable sample size requirements (1,254+ days for 5% detectable lift), and "best practice" guidance (short copy, personalization tactics) often reduced conversions. The bifurcation between platform capability and execution maturity remained the defining tension.
2025-Q4: A/B testing vendor consolidation accelerated with Statsig's OpenAI acquisition (September 2025), following Datadog's $220M Eppo acquisition in May. Bayesian methodology achieved mainstream adoption: 58% of large organizations now prefer Bayesian over frequentist methods; enterprise Bayesian adoption increased 45% year-over-year. Methodological innovations matured: hierarchical Bayesian frameworks for AI agent testing (Parloa Labs) and anytime-valid inference enabling continuous monitoring (Harvard/Netflix research) advanced frontier techniques. Statistical efficiency improvements standardized across platforms with 30-50% test speedup via variance reduction. Yet critical analysis exposed persistent misconceptions: widespread belief that Bayesian methods allow unlimited peeking without false positive inflation was debunked by simulation evidence (80% false positive rate with frequent peeking). Market adoption reached 78% globally; enterprise adoption metrics showed minimal business impact (12% of 10,000+ tests), revealing that platform maturity and statistical sophistication had not translated to improved decision-making outcomes. The practice remained defined by structural asymmetry: sophisticated practitioners operationalized warehouse-native testing with real-time monitoring and advanced statistical methods; most organizations faced persistent practitioner-level methodological errors and sample-size limitations that platform features could not ameliorate.
2026-Jan: A/B testing platforms matured further with enterprise AI integration entering the market; HelloFresh achieved 60x speedup in Bayesian testing pipeline through computational optimization, enabling thousands of concurrent experiments at scale. However, critical analyses published in January 2026 reinforced structural execution barriers: (1) False positive risks persisted at advanced organizations (26.4% even at Microsoft, Booking.com, Google, Netflix—indicating endemic methodological failures); (2) Real-world case study documentation showed failure patterns at Etsy, Duolingo, Heap, SumAll, and Facebook with metrics-specific failures (60%+ false positive inflation from early stopping); (3) Vendor transparency emerged as concern: Google and Meta systematically misrepresented observational A/B tests as randomized experiments, undermining causal inference. Statsig published boundary analysis identifying four critical scenarios where A/B testing should not be applied: limited traffic, dynamic environments, complex changes, and high-stakes contexts. AI-powered platforms (A/Bee) entered market claiming 22% average lift, signaling integration of generative AI into hypothesis and variation generation. The bifurcation between platform sophistication and execution maturity remained the defining structural tension of the practice as it entered 2026.
2026-Feb: A/B testing adoption metrics solidified: independent proprietary data from 90+ e-commerce brands confirmed 36.3% win rate with median +1.88% conversion uplift at 42-day test duration, validating real-world deployment effectiveness. Yet adoption barriers persisted: competitive platform analysis exposed vendor lock-in concerns, opaque experimentation engines, and per-event pricing scaling that became prohibitive at enterprise scale. Practitioner research with major social platforms documented temporal decay failures on dynamic platforms (TikTok, Facebook, YouTube), showing traditional statistical A/B testing assumptions break down in real-time environments. Platform maturity continued advancing—Statsig and competitors documented customers running thousands of experiments annually with warehouse-native and cloud deployment options, yet the foundational structural tension remained unresolved: sophisticated platform tooling had not translated to improved execution maturity or decision quality at most organizations.
2026-Mar: A/B testing practice demonstrated consolidating maturity with focused refinement on structural execution barriers. Amazon Science published research addressing winner's curse bias in impact estimation; Convert.com data showed 54% of organizations now at strategic/transformative maturity (up from 35% in 2021), signaling practitioner progression. AI-driven testing gained adoption with multiple named deployments (Ubisoft, Grene, WorkZone) documenting conversion uplifts. Critical assessments persisted: domain-specific failures documented in AI products (latency unmeasured until post-rollout), support team scaling (54% test fatigue at scale), and B2B adaptations requiring extended duration (4-8 weeks). Open-source tooling (BigQuery A/B Analyzer) advanced statistical bias mitigation, addressing real-world analytics platform limitations. Market sizing at $1.43B (2026), $2.73B (2032) projects continued 11%+ CAGR, yet the defining tension remained: platform sophistication had not reduced endemic methodological failures (early stopping, multiple testing, novelty bias) that prevented execution maturity at most organizations.
2026-Apr: Platform consolidation advanced with Datadog's GA launch of Experiments integrating A/B testing with observability (APM + business metrics), while analysis of 6,899 ecommerce tests and Kohavi's expert commentary (Microsoft 33% success vs. industry median 10%) reinforced persistent execution gaps. Research on 2,101 Optimizely experiments confirmed ~57% of practitioners p-hack, inflating false discovery from 33% to 42%; a new structural limitation emerged with AI-driven search traffic (14.2% vs. Google 2.8% CVR) breaking representativeness assumptions in dynamic environments. Uber's Experimentation Platform (XP) documented 1,000+ simultaneous experiments using SPRT, causal inference, and multi-armed bandits as the current gold standard — yet Uber's earlier Morpheus platform post-mortem revealed that "large percentage of experiments had fatal problems," illustrating that platform correctness failures at scale remain an unsolved engineering challenge. Spotify's analysis of 1,300 production experiments found a 22.6% false positive rate with five metrics uncorrected, and Amazon Science research on adaptive experimentation exposed non-stationarity as a practical limitation breaking adaptive method guarantees. Foundry CRO's industry-wide benchmarks sharpened the adoption-execution gap: 77% of companies claim A/B testing but less than 0.2% actively experiment, with only 36.3% of active testers achieving statistically significant wins (median +1.88% uplift); AI-assisted teams ran 4.7x more experiments per quarter, signaling where velocity gains concentrate.
2026-May: Platform methodology commoditization accelerated with Optimizely shipping contextual MABs, global holdouts, and MCP server integration enabling AI-driven test design; Spotify's warehouse-native Confidence platform documented 10,000+ experiments/year at 750M users with CUPED variance reduction and 42% guardrail-driven rollbacks, setting the current infrastructure benchmark. A DoorDash case study on A/B testing AI systems exposed a new class of execution fragility: models with good test performance showed 4.3% accuracy drops in production due to stochastic output variation, while Kameleoon adoption data confirmed that 84% of marketers test monthly but only 33.5% achieve statistical significance — the platform-execution gap remains structurally intact. Datadog launched its Experiments platform to GA (powered by the Eppo acquisition), integrating A/B testing with observability guardrails; Wikimedia Foundation deployed GrowthBook with documented auto-stop configuration (Clear Signals vs Do No Harm thresholds); and Amazon Science published two methodological advances addressing non-stationarity and Bayesian early termination — reinforcing that the research frontier continues advancing while DoorDash's 12,000+ experiments/year at 42M MAU sets the operational benchmark.
2026-Jun: Vendor automation advanced with Optimizely's Opal framework shipping specialized AI agents for QBR generation, value estimation, and backlog prioritization — the clearest signal yet of platform evolution toward orchestration automation. Google's fleet-wide A/B infrastructure case study documented production-grade deterministic assignment, causal-inference exposure logging, overlap management, and safety guardrails across interconnected global services; Statsig named-deployment evidence confirmed adoption scale (Notion 30x experimentation velocity, Ancestry 9x, Brex 50% data-scientist time savings). New failure modes surfaced for AI system testing: embedding model drift between test windows, feature flag leakage into model inputs, and shared memory contamination across variants document confounds that violate classical experimental assumptions, while mobile environments continue to present structural barriers — achieving statistical validity on a 3.2% baseline CVR requires 50,000+ committed users. GetYourGuide's sequential testing deployment achieved 40% experiment cycle reduction; Shopify's portfolio of 36 live winners across 1,000+ Plus-tier stores documented $2.3M+ monthly aggregate revenue lift, confirming that incremental real-world gains continue compounding at scale. Mid-June research advances: Persson et al. (arXiv) develop statistical foundations for LLM-based A/B testing, finding that LLM-only predictions recover only 39% of human treatment effects, with nonparametric calibration required for validity—critical methodological constraint for AI-powered hypothesis evaluation. Yu Zhang et al. address CUPED methodological subtleties in complex scenarios (multi-arm tests, two-stage designs), with findings deployed in ByteDance's production platform serving 1,000+ concurrent tests. Google Ads shipped structured asset A/B testing (June 11) for Performance Max, addressing the creative optimization black box problem with standardized experimentation framework. Wanted Lab (Korea's largest recruitment platform) achieved 150% sign-up conversion increase through formalized experimentation culture and data democratization, demonstrating that process maturity (not tool selection) drives real-world deployment success. Emerging constraint: traditional A/B testing methodology breaks for AI systems due to output non-determinism; practitioners must sequence offline evaluations before production tests and treat variants as parameterized systems rather than fixed treatments. AI agent testing (distinct from feature A/B testing) requires 10K+ interactions per variant and gold-set validation (200–500 curated examples) to overcome stochasticity variance.