The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
Tools and processes for detecting AI-generated hallucinations and assessing the factual accuracy of model outputs. Includes automated fact-grounding and source verification; distinct from fact-checking in research which verifies human-authored rather than AI-generated claims.
Hallucination detection has reached the point where forward-leaning organisations are deploying it in production, but most have not started — and the gap between vendor claims and real-world reliability remains the practice's defining tension. Detection and factuality assessment techniques aim to identify when LLMs generate plausible-sounding false claims. GA tooling now exists from AWS, Vectara, Datadog, and Microsoft, yet independent research consistently shows that benchmark accuracy does not predict production performance: benchmarks use artificially short prompts and simple contexts, while real systems operate at two to five times that complexity. The result is a field where platform maturity has outpaced measurement credibility. No single detection method generalises across languages, domains, and reasoning types, and leading models still hallucinate at 15-16% baselines on realistic evaluations. Detection is a necessary governance control, but it functions as one layer in a stack that still requires human oversight and external grounding — not a standalone solution.
The vanguard of adoption centres on a handful of named deployments across multiple sectors. DoorDash fields hundreds of thousands of AI-enabled customer calls daily using RAG with hallucination detection; AWS Bedrock trials block 75% of unsupported answers; and Vectara's HHEM model has surpassed 100,000 downloads, offering real-time factual consistency scoring. Enterprise adopters are beginning to integrate hallucination risk scores as formal QA gates in production pipelines, with early reports of 40% fewer escalations. Real-world implementations span financial crime (Unit21's 500,000+ verified alert reviews using detection layering), legal research (Citation Grounding achieving 98.5% validation accuracy on commercial models via fine-tuned DPO), and customer service (RichPanel's four-layer defense across 2,000+ enterprise deployments achieving sub-1% hallucination). Multi-model verification architectures show measurable impact: a June 2026 study across 480 million AI outputs found cross-model verification reduces hallucination from 8.3% to 3.2% (61% reduction) in legal, financial, and healthcare deployments. These deployments demonstrate that detection is operationally viable in bounded domains, but they remain the frontier — not the field.
The research frontier has expanded beyond text. Amazon Science's VADE detects hallucinations in vision-language models via attention maps; ACL 2026 introduced HalluAudio, the first large-scale audio hallucination benchmark spanning speech, environmental sound, and music with 5K+ human-verified pairs. Financial long-context detection advanced with NeurIPS 2025's PHANTOM benchmark, testing hallucination detection on 500-30K token SEC filings—revealing that small models still score near-random and Lost-in-the-Middle degradation remains unresolved at scale. Amazon Science also published real-time detection methods for speech LLMs using attention-derived metrics, extending detection to the spoken-word domain. Multimodal expansion signals the field's coming-of-age, but each modality surfaces new measurement challenges.
The credibility problem runs deeper than any single vendor. AWS claims 99% verification accuracy for Bedrock Automated Reasoning; Vectara reports sub-1% hallucination rates for small models. Independent evaluations tell a different story: frontier models expose a benchmark-accuracy paradox—GPT-5.5 tops every benchmark yet hallucinate at 86%, a critical negative signal that accuracy-only metrics structurally reward confident guessing over calibrated uncertainty. This paradox extends to reasoning models: deployment data shows reasoning variants (o3, o4-mini, reasoning-focused DeepSeek) hallucinate at 33-48% despite their advanced capabilities, a 2-3× multiplier versus base models—indicating that encouraging models to "think harder" amplifies rather than reduces hallucination. Cross-lingual evaluations show hallucination rates 15-35% higher in non-English languages, exposing that detection methods trained on English corpora fail systematically in low-resource language contexts. EACL 2026 peer-reviewed research (April) confirms that detection techniques measure consistency (with 50%+ inconsistency in benchmarks like Med-HALT), not correctness—fundamentally misaligned with what practitioners need. The AuthenHallu benchmark reveals only 60% detection accuracy even on SOTA models against authentic LLM-human interactions. Detection also degrades where enterprises need it most: grounding evaluation breaks down in multi-turn agentic workflows (HalluHard benchmark shows 30%+ hallucination even with web search; new CHARM framework research reveals cascading hallucinations are a distinct failure mode that output-level detection misses by 70%), and no method yet handles the intersection of multilingual, multi-domain, and extended-context inputs. A stark real-world signal: a June 2026 audit of 2.5 million academic papers found 146,000+ hallucinated citations published in 2025—with 85.3% surviving peer review despite detection infrastructure—documenting governance failure at scale even in communities with strong verification norms. Detection methods optimize for detectable consistency while missing correctness gaps, making deployment without mandatory human oversight unsafe in high-stakes domains. The field has reached an inflection: detection is GA, widely available, and necessary—but fundamentally insufficient without human oversight and governance layers.
— Large-scale empirical study of 480 million AI outputs across legal, financial, and healthcare deployments showing multi-model verification reduces hallucination from 8.3% to 3.2% with Claude Opus 4.7 and Gemini 3.1 Pro achieving lowest error rates.
— Critical assessment documenting benchmark-production gap: GPT-5.5 shows 86% hallucination rate on independent evaluation despite headline improvements; identifies reasoning model paradox where advanced models hallucinate 2-3× more than base models.
— Framework formalizing cascading hallucinations as distinct agentic failure mode, achieving 89.4% cascade detection with 82.1% error propagation reduction versus 18.5% for output-level detectors, addressing multi-step workflow gaps.
— Domain-specific legal hallucination detection evaluated on AWS Bedrock (Claude, Mistral, Amazon Nova) revealing 13-21% citation hallucination rates with 98.5% validation accuracy through fine-tuned Citation Grounding DPO framework.
— Multi-institutional audit documenting 146K+ hallucinated citations across 2.5M academic papers with detection failure at scale—85.3% of hallucinations survived peer review, critical negative signal on governance effectiveness.
— Production technique evaluation from 500,000+ financial crime alert reviews: eval sets, deterministic code generation, context engineering, and safety nets achieve measurable accuracy in regulated high-stakes domain.
— STAR Protocols peer-reviewed ensemble voting method with RAG achieving 76.85% zero-hallucination rate on 10,000+ medical terminology tests—validated detection technique with reproducible measurement and zero false positives.
— First multi-turn financial hallucination detection benchmark in non-English domain revealing persistent refusal-behavior gap—even frontier models struggle with fine-grained financial diagnostics despite high binary detection F1 scores.