The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that helps individuals structure decisions, evaluate options, and apply reasoning frameworks to complex choices. Includes decision matrix generation and pro/con analysis; distinct from feature prioritisation which applies frameworks to product decisions rather than general personal choices.
AI-assisted decision support has proven it can work in narrow, data-rich contexts -- but it cannot yet be trusted broadly. That tension defines its bleeding-edge status. A handful of production deployments show genuine value in financial services orchestration, targeted healthcare screening, and marketing decisioning. Yet across 2,400+ enterprise AI initiatives, the failure rate sits at 80%, and 95% of generative AI pilots never reach production. The tools exist; reliable operationalisation does not. Fundamental LLM reasoning flaws are now mechanically understood: working memory saturates at 20-30 parallel branches regardless of model size; semantic variants trigger 28-45% answer-flip rates; chain-of-thought explanations are unfaithful post-hoc narratives rather than reasoning logs. These are not capabilities that scale with parameter count—they are architectural constraints. Regulatory pressure is accelerating faster than the field's ability to respond, with CCPA automated-decision rules and EU AI Act enforcement creating compliance deadlines that most commercial solutions cannot yet meet transparently. Legal precedent is forming: courts are establishing that executives cannot delegate decision accountability to AI, requiring mandatory governance documentation and verified reasoning. For individuals seeking structured decision support, AI can generate useful frameworks and surface tradeoffs, but the gap between plausible output and dependable guidance remains wide.
The credible deployments share a pattern: tightly scoped problems with rich, structured data. Government agencies are operationalizing decision-support systems for emergency response (Japan's tsunami impact modeling in seconds, Sweden's autonomous defibrillator drones reducing emergency response by 3+ minutes). ARPIA's financial services platform moves from raw data to activated collection strategy in 13 minutes. causaLens reports 5x ROI at Johnson & Johnson and McCann Worldgroup. The UK Health Security Agency's AI-assisted TB screening holds 90% accuracy while cutting manual review workload by 85%. These are real results, but they are islands, not a continent.
Professional services adoption nearly doubled to 40% organisation-wide in 2026 from 22% in 2025, according to a Thomson Reuters survey of 1,500+ professionals across 27 countries. The measurement story is less encouraging: only 18% of those firms track ROI, and 40% report client confusion over AI-use policies. Adoption is outrunning accountability. Recent empirical research reveals mechanistic constraints: all tested LLMs hit working memory saturation at identical parallel-branch thresholds (20-30) regardless of model size, indicating an architectural ceiling rather than a solvable scaling problem. Semantic perturbations trigger 28-45% accuracy flips, meaning models lack robustness to input variation. Chain-of-thought explanations—the dominant approach for AI-assisted reasoning—are unfaithful: Claude 3.7 discloses actual reasoning paths only 25% of the time, suggesting explanations are post-hoc cover stories rather than evidence of actual reasoning. Commercial clinical decision-support systems compound the problem: a peer-reviewed scan of available AI-CDSS solutions found critical transparency gaps in training data provenance, algorithmic methods, and privacy compliance, blocking reliable evaluation before deployment. Legal precedent is now setting guardrails: Australian Federal Court ruled that executives using public AI tools informally without verification breach their duty of care, and multiple jurisdictions are establishing that decision accountability cannot be delegated to AI, requiring mandatory documentation of reasoning review and source verification. Regulatory deadlines -- California's CCPA automated-decision rules already in effect and EU AI Act enforcement arriving in August 2026 -- are forcing governance onto the roadmap whether organisations are ready or not.
— Survey (72 leaders, 30+ industries) shows 96% maintain human-in-the-loop on consequential decisions; 61% cite workflow redesign as primary enabler; governance framework is load-bearing for production deployment, not optional overhead.
— 18-month empirical implementation study develops 6-module governance framework; reveals clinical AI systems remain confined to pilots not due to model limitations but institutional decision-making and governance capacity gaps.
— Analysis of AI overconfidence as design choice: RLHF incentivizes confidence without uncertainty; ECE 0.726 with 23% accuracy; standard calibration techniques reduce error 90% but not deployed—systemic design bias against decision-support reliability.
— Research shows 10-minute AI assistance impairs independent problem-solving through cognitive offloading; reduced retention and analytical skepticism; unintended consequence: outsourcing thinking erodes decision autonomy and agency.
— Empirical testing across 11 frontier models (67,221 records) reveals 8 collapse under adversarial pressure with 30.2pp accuracy drops; Anthropic Constitutional AI near-immune, indicating alignment-specific stability required for decision-support reliability.
— <20% of enterprise AI pilots reach production due to missing trust infrastructure: explainability, audit trails, governance, liability clarity. Core barrier is not capability but decision-framework accountability structures.
— Analysis of 5.5M real-world interactions shows top models fail ~9% overall, 14-16% on expert decision-making tasks (finance/law/medical); performance improvement flatlined despite compute scaling, exposing adoption barriers.
— Wharton RCT (1,300+ participants) shows AI-assisted decisions improve 25pp when correct but degrade 15pp when wrong; overconfidence persists even at 50% error rate, demonstrating cognitive surrender risk in decision frameworks.
2022-H2: First identified research surge in AI reasoning benchmarks (temporal, step-by-step, knowledge-graph) and human-in-the-loop frameworks; major failures in practice (Zillow, model brittleness); ~85% enterprise project failure rate documented; theoretical work on reasoning fallibility and the need for AI to express uncertainty.
2023-H1: Research focus shifted to human-AI interaction challenges: overreliance on AI suggestions despite explanations, user dropout when AI feedback is unhelpful, and widespread concerns about accountability and risk. Evidence of adoption barriers in personal decision support remained dominant; no large-scale personal reasoning framework deployments documented.
2023-H2: Research concentrated on three critical areas: (1) human-centered design frameworks for DSSs (PAAI questionnaire with 700+ participant validation), (2) trust and accountability barriers blocking clinical deployment (liability concerns, accuracy standards), and (3) underlying AI reasoning capabilities (foundation model survey). Evaluation gaps documented—despite renewed interest, empirical evidence on AI-CDS effectiveness remained scarce. Governance and risk analysis highlighted over-reliance, bias, and dynamic environment brittleness as core failure modes.
2024-Q1: Adoption accelerated dramatically: ~90% of enterprises deployed AI for autonomous decision-making. Simultaneously, fundamental technical limitations became clearer—Apple research confirmed critical reasoning flaws in LLMs (GSM-Symbolic benchmark), and AI systems scored only 30% on novel reasoning tasks (ARC). Bias in operational AI decision systems documented in education. Revenue impact quantified: 6% average annual loss from underperforming models. The execution gap widened: implementations failed at generalization, bias mitigation, and reliability despite widespread organizational trust. Empirical evidence on decision support effectiveness remained sparse, leaving large-scale deployments without measured impact validation.
2024-Q2: Real-world deployment evidence emerged, revealing persistent failures despite adoption breadth. A Dutch court case study showed AI decision-support in legal proceedings, while a Harvard-led RCT in Wisconsin courts found AI recommendations failed to improve bail decisions and judges rejected them 30%+ of the time. Experimental research with 1,403 participants confirmed overreliance remains endemic despite explainability efforts—workers align with biased AI recommendations up to 90% in hiring contexts. A 600-executive survey found 48% of AI projects paused or rolled back due to privacy, regulatory, and integration challenges. The landscape shifted from "can we build AI decision systems" to "why are deployed systems failing to improve decisions"—adoption at scale masked persistent technical and organizational gaps.
2024-Q3: Research clarified three persistent barriers to effective human-AI decision-making: achieving complementarity, managing human mental models, and design choices that prevent cognitive overload. Healthcare case studies identified prerequisite frameworks for responsible AI-DSS (bias mitigation, human-centric learning loops, incremental trust-building). Critical assessments documented permanent risks in high-stakes decision contexts (military targeting, legal proceedings) due to hallucinations, brittleness, and inability to ensure regulatory compliance. Analyst forecasts predicted 30% project abandonment post-proof-of-concept by end of 2025, with organizations struggling to realize value despite major investments. The field continued to reconcile widespread enterprise adoption with persistent deployment failures, unresolved bias risks, and absence of clear impact metrics on decision quality.
2024-Q4: Critical research published on technical and organizational barriers to reliable AI reasoning. New findings revealed overreliance persists in clinical decision-making despite trust calibration efforts, with physicians exhibiting diagnostic errors from AI misalignment. Healthcare professionals identified systemic ethical concerns (bias, transparency gaps, accountability deficits) in AI-CDSS deployments. Fundamental research showed state-of-the-art reasoning models exhibit overthinking and discard correct reasoning paths, undermining the assumption that larger models improve decision support. Legal sector adoption continued despite significant concerns: 43% of legal professionals observed bias, 37% feared unreliability. Clinical adoption surveys showed 76% of physicians now use LLMs for decisions yet 97% vet outputs, indicating cautious rather than confident deployment. By year-end 2024, the field had reached consensus that the core challenge is not building reasoning systems but deploying them safely and measurably—technical limitations in AI reasoning were well-documented, but practical implementation remained the bottleneck. Organizations continued investing despite unresolved risks, suggesting adoption momentum has decoupled from evidence of effectiveness.
2025-Q1: Enterprise adoption continued but real-world reliability challenges intensified. UK government abandoned multiple welfare-system AI pilots (A-cubed, Aigent) due to scalability and reliability concerns—explicit signal of deployment failure in public sector decision support. Healthcare outcomes improved in targeted deployments: UK Health Security Agency achieved 90% accuracy in TB screening with 85% reduction in manual review workload. Research on adoption barriers showed 450 physicians in China identified multiple adoption pathways depending on hospital type and organizational context. Critical research revealed that human oversight alone is insufficient to prevent discrimination: EU study found human decision-makers equally likely to follow biased AI recommendations regardless of fairness-algorithm design. Fundamental reasoning limitations persisted: AI reasoning models continued exhibiting data bias, lack of common sense, and transparency failures that undermine high-stakes decision-making. The gap between pilot success and production scaling widened: isolated cases showed operational gains, but public sector abandonment and persistent bias findings suggested the field remained pre-scale.
2025-Q2: Evidence revealed critical implementation gaps despite continued investment. Dermatology study (223 physicians) found AI support yielded only 1% accuracy improvement with low reliance (10%), indicating adoption barriers persist even in favorable clinical contexts. Defense deployments (Project Maven, UK autonomous targeting, Iron Dome) demonstrated real-world AI-DSS use but in high-stakes, tightly constrained settings. ChatGPT testing showed AI mirrors human decision-making biases including overconfidence and gambler's fallacy in half of scenarios, suggesting AI amplifies cognitive flaws rather than mitigating them. Expert Delphi consensus identified 34 critical implementation factors for healthcare AI-DSS, yet organizational capacity to execute remained limited. Industry analysis showed only 26% of companies have working AI products and 4% achieve significant ROI; Gartner predicted 40%+ project cancellations by 2027 due to unclear value and costs. Parallel evidence of high adoption breadth (93% of leaders report GenAI competitive benefits) masked low implementation depth and persistent execution challenges.
2025-Q3: Research clarified fundamental and persistent technical limitations in AI reasoning: models performed no better than humans on novel problems and replicated cognitive biases including overconfidence. MIT analysis of 300 deployments found 95% of AI pilots failed to deliver value, with vendor solutions succeeding ~67% versus internal builds 33%—exposing both adoption and execution challenges. Consumer trust surveys (YouGov, 10K respondents) showed 52% comfort with AI for daily personal decisions but only 39% for financial decisions; humans retained override preference in 55%+ of scenarios. New tools for bias detection (CMU AIR) and structured decision frameworks (MCDM-based ModelSelect with 50 case-study validation) promised incremental rigor improvements yet could not address fundamental reasoning limitations. Research documented that AI actively degraded decision quality: executives using generative AI made worse forecasts than without it, highlighting the risk of overconfidence in AI-enhanced reasoning. The field remained characterized by adoption momentum decoupled from evidence of effectiveness, with organizations continuing heavy investment despite quantified failures and persistent technical barriers.
2025-Q4: Deployment evidence revealed domain-specific outcomes: IBM achieved $4.5B productivity impact from agentic AI deployed to 270K employees; marketing decision-intelligence platforms reached 26-75% adoption with measurable ROI; UK Health Security Agency's AI-assisted TB screening achieved 90% accuracy. Yet critical limitations emerged across high-stakes domains: medical data gaps (EMR design flaws, not algorithmic limitations) constrained clinical decision-support impact; Indian judges warned of AI-fabricated legal judgments and hallucinations; government pilots stalled due to scaling and budget challenges; legal professionals documented persistent bias (43% observing bias, 37% fearing unreliability). Medical educators flagged overreliance risks: GenAI tools threaten critical thinking skill development and reinforce training data biases. Technical advances in reasoning (GPT-5.1 integration, causal AI frameworks) continued, yet ethics scholars debated justified use of black-box AI in high-stakes domains. By year-end, the field had consolidated around differentiation by domain: operational value in narrow contexts (marketing, logistics) versus persistent barriers and documented risks in broader organizational and high-stakes deployment scenarios. Adoption momentum remained decoupled from evidence of effectiveness, with organizations continuing investment despite quantified failures and unresolved deployment barriers.
2026-Jan: Enterprise transition to operationalization emphasized data governance and architectural foundations; 62% of enterprises planning evolution to AI decision intelligence amid persistent 70-85% project failure rates and 42% initiative abandonment in 2025. Causal AI emerged as next-frontier addressing 74% faithfulness gap in existing systems. Clinical research documented error reduction (78% decline in guideline violations) through hybrid frameworks, yet deployment barriers remained: only 12% of executives reported both cost and revenue benefits; physician studies highlighted that reasoning cues must target high-discretion tasks where AI can add genuine value.
2026-Feb: Multi-AI orchestration demonstrated operational feasibility (ARPIA 13-min data-to-strategy pipeline, causaLens enterprise deployments); professional services adoption jumped to 40% (2025: 22%), yet only 18% track ROI and 40% report policy confusion. Systematic LLM reasoning failures (Reversal Curse, Robustness Fragility, Working Memory Leaks) documented, undermining reliance on AI reasoning chains. Commercial AI-CDSS solutions still lack transparent training data and algorithm disclosure. Regulatory deadlines (CCPA Jan 2026, EU AI Act Aug 2026) drove governance platform launches. Across 2,400+ enterprise AI initiatives, 80.3% failed (33.8% abandoned, 28.4% deliver no value), with 95% GenAI pilots failing to reach production. Execution and governance remain bottlenecks, not capability.
2026-Mar: Fundamental research documented persistent reasoning failures: CRYSTAL benchmark shows models skip 50%+ of reasoning steps (58% accuracy but only 48% reasoning recovery); BrainBench reveals stochastic reasoning gaps (6-16pp consistency variance even in top models); Stanford taxonomy classifies failures as architectural rather than scale-addressable. Real-world failures documented: NZ courts ruled AI-hallucinated legal citations may amount to obstruction; Deloitte refunded AUD 440K for AI-generated errors. Governance frameworks consolidated: RegTech expert consensus establishes human accountability cannot be delegated; KPMG legal analysis requires mandatory documentation of decision review. Practical deployment barriers clarified: reasoning models show 5x cost premium with performance ceiling at medium-complexity tasks (above which accuracy collapses). Field consensus solidifying around decision-support constraints: execution challenges and governance requirements are primary blockers, not reasoning capability gaps.
2026-Apr: New empirical research confirmed architectural reasoning limits are scale-invariant: testing across 7 models (8B–235B parameters) showed all collapse at 20-30 parallel branches, and semantic variants of problems trigger 28-45% answer-flip rates — confirming brittleness is structural, not addressable by larger models. CMU testing of 14 leading LLMs (GPT-4, Claude 3, Gemini) found all fail simple logical contradiction detection, revealing that benchmark performance masks fundamental reasoning gaps. Harvard research added a new dimension: relational complexity causes accuracy to collapse when decisions require weighing multiple interacting factors simultaneously, directly constraining multi-factor analysis in healthcare and strategic contexts. Legal accountability frameworks tightened: Australian Federal Court (ASIC v Bekier) established that executives using public AI tools informally without verification breach their duty of care, reinforcing that decision accountability cannot be delegated to AI and mandating documentation of reasoning review. Latest evidence on personal decision support reveals critical tensions: targeted healthcare deployments show measurable value (RCT with 367 participants: 7.4-point satisfaction improvement, 50.7% vs 24.2% acceptance of AI recommendations), yet passive reliance on AI reasoning systematically erodes confidence in independent judgment and sense of authorship (behavioral study, 1,923 adults). Empirical adoption barriers remain severe: 9% of professionals trust AI for complex decisions despite 88% organizational adoption; 80% of workers reject enterprise AI tools (WalkMe, 3,750 professionals). Reasoning failure modes are now precisely characterized: hallucination rates span 22-94% across models (Stanford 2026 AI Index), with a documented "Reliability Gap" where capability scales 2-3x annually while reliability only 1.2-1.5x. Governance frameworks are consolidating: the SPEC framework achieves 89% accuracy vs 15% for unbounded RAG on incomplete-evidence scenarios by bounding AI confidence to evidential sufficiency; the CFA Institute articulates an epistemic anchoring principle that decision authority must remain in evidence-based human inquiry to avoid "knowledge-collapse equilibrium."
2026-May: Research intensified focus on cognitive and systemic failure modes in AI-assisted decision-making. Wharton's Cognitive Reflection Test (1,300+ participants) demonstrated the core paradox: AI-correct advice improves accuracy 25pp, but AI-wrong advice degrades 15pp below baseline—worse than no AI. Overconfidence persists even when users know AI errs 50% of the time, indicating cognitive surrender rather than rational reliance. Metacognitive stability revealed as critical: empirical testing across 11 frontier models shows 8 collapse under adversarial pressure (30.2pp accuracy drops), with only Anthropic's Constitutional AI showing near-immunity—suggesting alignment-specific training is prerequisite for trustworthy reasoning, not achievable through standard RLHF. Enterprise adoption gaps widened: <20% of AI pilots reach production due to missing trust infrastructure (audit trails, explainability, liability frameworks); yet production deployments in bounded domains (insurance claims, pharma R&D) show 50%+ efficiency gains when governance frameworks are load-bearing. Real-world interaction data (5.5M instances) reveals expert-domain performance plateaued at 14-16% dissatisfaction despite scale; raw models generate confident false theories when given deliberately flawed premises. Most critical finding: brief AI assistance (10 minutes) systematically impairs independent problem-solving through cognitive offloading, reducing retention and analytical skepticism—unintended consequence suggesting tool design actively erodes reasoning autonomy. Governance literature consolidated: AI system abandonment driven primarily by organizational dynamics and resource constraints, not ethics concerns; clinical AI remains confined to pilots not due to model limitations but institutional capacity gaps. The field's consensus strengthened: decision-support reliability requires orchestration of governance, human oversight, design discipline, and alignment-specific training—capability alone is necessary but insufficient.