The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
Tools and processes for detecting AI-generated hallucinations and assessing the factual accuracy of model outputs. Includes automated fact-grounding and source verification; distinct from fact-checking in research which verifies human-authored rather than AI-generated claims.
Hallucination detection has reached the point where forward-leaning organisations are deploying it in production, but most have not started — and the gap between vendor claims and real-world reliability remains the practice's defining tension. Detection and factuality assessment techniques aim to identify when LLMs generate plausible-sounding false claims. GA tooling now exists from AWS, Vectara, Datadog, and Microsoft, yet independent research consistently shows that benchmark accuracy does not predict production performance: benchmarks use artificially short prompts and simple contexts, while real systems operate at two to five times that complexity. The result is a field where platform maturity has outpaced measurement credibility. No single detection method generalises across languages, domains, and reasoning types, and leading models still hallucinate at 15-16% baselines on realistic evaluations. Detection is a necessary governance control, but it functions as one layer in a stack that still requires human oversight and external grounding — not a standalone solution.
The vanguard of adoption centres on a handful of named deployments. DoorDash fields hundreds of thousands of AI-enabled customer calls daily using RAG with hallucination detection; AWS Bedrock trials block 75% of unsupported answers; and Vectara's HHEM model has surpassed 100,000 downloads, offering real-time factual consistency scoring. Enterprise adopters are beginning to integrate hallucination risk scores as formal QA gates in production pipelines, with early reports of 40% fewer escalations. Real-world implementations like Qdrant's three-layer defense demonstrate 94% hallucination reduction across five enterprise clients in production. These are meaningful signals, but they represent the frontier — not the field.
The research frontier has expanded beyond text. Amazon Science's VADE detects hallucinations in vision-language models via attention maps; ACL 2026 introduced HalluAudio, the first large-scale audio hallucination benchmark spanning speech, environmental sound, and music with 5K+ human-verified pairs. Financial long-context detection advanced with NeurIPS 2025's PHANTOM benchmark, testing hallucination detection on 500-30K token SEC filings—revealing that small models still score near-random and Lost-in-the-Middle degradation remains unresolved at scale. Amazon Science also published real-time detection methods for speech LLMs using attention-derived metrics, extending detection to the spoken-word domain. Multimodal expansion signals the field's coming-of-age, but each modality surfaces new measurement challenges.
The credibility problem runs deeper than any single vendor. AWS claims 99% verification accuracy for Bedrock Automated Reasoning; Vectara reports sub-1% hallucination rates for small models. Independent evaluations tell a different story: frontier models expose a benchmark-accuracy paradox—GPT-5.5 tops every benchmark yet hallucinate at 86%, a critical negative signal that accuracy-only metrics structurally reward confident guessing over calibrated uncertainty. Cross-lingual evaluations show hallucination rates 15-35% higher in non-English languages, exposing that detection methods trained on English corpora fail systematically in low-resource language contexts. EACL 2026 peer-reviewed research (April) confirms that detection techniques measure consistency (with 50%+ inconsistency in benchmarks like Med-HALT), not correctness—fundamentally misaligned with what practitioners need. The AuthenHallu benchmark reveals only 60% detection accuracy even on SOTA models against authentic LLM-human interactions. Detection also degrades where enterprises need it most: grounding evaluation breaks down in multi-turn agentic workflows (HalluHard benchmark shows 30%+ hallucination even with web search), and no method yet handles the intersection of multilingual, multi-domain, and extended-context inputs. A stark real-world signal: ICLR 2026 integrity crisis documented 50+ peer-reviewed papers accepted to a top venue despite containing hallucinated citations and fabricated datasets—a failure of detection governance at the research infrastructure level. Detection methods optimize for detectable consistency while missing correctness gaps, making deployment without mandatory human oversight unsafe in high-stakes domains. The field has reached an inflection: detection is GA, widely available, and necessary—but fundamentally insufficient without human oversight and governance layers.
— Amazon Science peer-reviewed inference-time hallucination detection for speech LLMs using attention-derived metrics (AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY), advancing real-time detection in spoken-word contexts.
— Critical assessment showing frontier model (GPT-5.5) achieves highest accuracy but 86% hallucination rate, with Nature paper evidence that accuracy-only benchmarks structurally reward confident guessing over calibrated uncertainty—negative signal on detection sufficiency.
— Practitioner analysis documenting production discrepancies: GPT-5.5 achieves 57% accuracy but 86% hallucination rate on same benchmark, showing accuracy and hallucination as separate dimensions requiring independent assessment.
— NeurIPS 2025 Datasets & Benchmarks benchmark for hallucination detection on 5K+ SEC filing QA pairs at 500-30K token lengths, revealing Lost-in-the-Middle degradation and that small models score near-random—addressing critical gap in long-context financial hallucination detection.
— ACL 2026 first large-scale audio hallucination benchmark (5K+ human-verified QA pairs) across speech, environmental sound, and music; systematic evaluation demonstrating that hallucination detection research now spans text, vision, and audio modalities.
— Practitioner analysis documenting hallucination rates 15-35% higher in non-English languages (38-point deficits in low-resource languages), exposing evaluation gaps in multilingual production systems.
— Real-world detection failure at scale: 50+ ICLR papers contained hallucinated citations and fabricated datasets despite peer review, documenting governance lessons and detection gaps applicable to enterprise systems.
— Amazon Science research presenting attention-map-based hallucination detection for Vision Language Models by identifying misalignment between outputs and visual content, extending detection methods to multimodal domain.